What Are Skiplists Good For?
This article delves into skiplists, an often-overlooked randomized data structure, and reveals how a unique adaptation called a "skiptree" proved indispensable for Antithesis. It highlights a clever, albeit complex, solution to efficiently query hierarchical data in analytical databases like Google BigQuery. The story appeals to engineers who appreciate ingenious applications of fundamental computer science to solve real-world, high-performance problems.
The Lowdown
The author recounts his journey from dismissing skiplists as a niche data structure to discovering their profound utility in a generalized form, the "skiptree," at his company, Antithesis. This technical deep dive explains both the problem faced and the innovative solution derived from adapting an obscure data structure.
- Skiplists Explained: A skiplist is a probabilistic data structure acting as a drop-in replacement for binary search trees, offering
O(log n)performance. It functions as a linked list with multiple "express lanes" at progressively higher levels, allowing faster traversal, and is known for relatively simple concurrent implementations. - The Antithesis Problem: Antithesis needed to analyze branching timelines generated by their fuzzer, requiring frequent ancestor lookups in a tree-like data structure. Storing this in Google BigQuery, an analytical database optimized for scans, led to inefficient
O(depth)point lookups for each step of an ancestor query. - Traditional Solutions Shortcomings: Using an OLTP database for the tree structure alongside BigQuery for bulk data would introduce complex two-phase commit consistency issues, which the author wanted to avoid. BigQuery's loose consistency further complicated such an approach.
- Introducing Skiptrees: The solution was a novel data structure called a "skiptree," essentially a generalization of skiplists for trees. It involves a hierarchy of trees, where each path from root to leaf in the original tree forms a skiplist structure across these levels.
- Implementation & Benefits: Skiptrees were stored across multiple SQL tables (one for each level), using
next_level_ancestorandancestors_betweencolumns. This allowed ancestor lookups to be performed using a fixed number ofJOINoperations rather than recursive point lookups. While the resulting SQL queries were large, this approach leveraged BigQuery's pricing model (data scanned, not compute) and significantly reduced costs and improved performance for Antithesis for six years.
Ultimately, the author acknowledges that his "skiptree" shared similarities with existing "skip graphs," reinforcing the idea that innovative solutions often echo prior work. The core message is the unexpected value of obscure data structures and how even a somewhat naive implementation of a skiplist concept can offer significant performance gains in challenging scenarios.