Skiplist discussions that don't talk about heap effects are probably incomplete. The big limitation with skiplists for anything that isn't a "mostly-write-once" workload (something that, it should be noted, doesn't get much benefit from the atomic implementation discussed) is that the metadata record is variable sized, which does bad things to heap fragmentation and cache locality (two things you probably care deeply about if you're going so far as to junk your AVL tree for a lockless data structure) and isn't amenable to the slab-style optimizations used elsewhere.
I love skiplists for their simplicity, especially compared with the balanced trees against which they compete. But the truth is they're very one-trick-pony data structures and don't bring lot to the table as a general choice. Their home is pretty limited.
Skiplists can be very effective in disk structures for few of reasons.
For one, a sorted sequence of keys can be written to file in a single pass, and it's easy to interleave with the data itself. For example, Lucene uses multi-level skip lists in compressed posting list files in order to quickly jump to a term and then jump to the lowest matching document ID for that term. Since these files are immutable, the tree is only built once and all the data can be stored in sorted order, which has the added benefit that these files can also be merged in a single pass.
Append-only is a very convenient corner case for skiplists to optimize. I call it "skiplogs" because it looks really different from the general case. An average node can be 3 bytes, for example.
Arenas are just heaps, and subject to all the same problems. It's true that in the LSM use case in question (and I need to point out that databases are not my area!) there is a clear "flush" operation that will put a limit on how much heap churn and fragmentation you need to tolerate. And maybe that's an answer that makes skiplists more desirable in this context. But things like long-lived app data and language runtimes don't have that, and skiplists turn into headaches.
A simpler consequence of the same problem: I mostly work in embedded stuff and skiplists are pretty much a non-starter as they can't easily be made intrusive without a bunch of waste.
It's my understanding that in the context of append-only skip lists, like what RocksDB uses for memtables, you can see wasted space at the end of a block within an arena but no fragmentation between records within a block. SSTables are another story.
> It's generally my view that from a bird's eye view LSMs are simpler than on-disk Btrees, largely because they are more composable and a lot more of their data is immutable.
I dunno about this. On-disk B-trees can actually be pretty dang simple! Both conceptually and in implementation.
In concept, for an on-disk B-tree "table", you've got:
1. one growable file, or a sequence of fixed-size split files, abstracted over into a "vector of pages" type that you can 1. find the base pointer of, 2. overwrite a full page of, or 3. extend;
2. a "metapage" type, that works like the world's most trivial filesystem: it owns a "freelist" of page numbers, and keeps a pointer to the live root of the B-tree; and it embeds its own hash, and a version number.
3. a mechanism for discovering/choosing the newest live+valid metapage, essentially making this trivial "filesystem" into a trivial "journalling" filesystem. (in e.g. LMDB, you just have two metapages in pages 0 and 1, that are alternately written to, sort of like double-buffering; on DB open, the newer of the two by version is chosen if it's valid by hash; otherwise the older is used.)
4. a readers-writer lock (which doesn't need to be persisted to disk, because the on-disk file is never in a volatile/dirty state. You can just OS-advisory-lock the file — those locks disappear when the creator process dies, which is what you want here);
5. read-only transactions, that take a reader lock, discover the newest live metapage, dereference the B-tree root, and pass that pointer off to the user, letting them have at it, reading the B-tree arbitrarily as if it were an in-memory data structure;
6. read-write transactions, that take a writer lock, discover the newest live metapage, dereference the b-tree root, set up an in-memory dirtied pages map, and then let you read arbitrarily + write arbitrarily (with reads indirected through the dirtied-pages map, acting as an overlay);
7. Copy-on-Write updates to "clean" pages during rw txs, by cloning the clean page contents onto free[listed] pages, dirtying the clone with the update, and adding the dirty version of the page to the dirtied pages map, to overlay the original page;
8. a read-write tx commit op that "propagates" the changes from dirty pages, by Copy-on-Write rewriting the B-tree ancestors of the parent pages to point to the new dirty page numbers, until you derive a new B-tree root page — which you then create a new metapage for, fsync the file, and then write the metapage.
(Note that this last bit means that a read-write transaction will effectively implicitly "roll back" on crash/power cut — the data is written before the "journal entry" for it, so if the "journal" doesn't make it to disk, then the data updates — including the updated freelist — are "lost". All a failed tx has done is update free pages.)
Sure, that's eight things you need to implement. But none of them are complicated or unintuitive, the way that a "lock free concurrent skiplist" is. You could whiteboard any one of those abstractions above with a junior programmer, and they could probably figure out a basically-correct implementation just from the descriptions above.
And that means that the implementations of on-disk B-trees are usually pretty short-and-sweet.
Lsms and sstables involve compaction in all data write patterns that involve updates or deletes.
I don't know how Rockdb handles this, but this is a very very non-trivial problem in Cassandra.
Let's say your update spans a row of data that is stored across multiple sstables (the row of data, that is a set of column values, was formed in multiple updates that spanned multiple flushes of the mentable to disk/sstables.
So either as part of your update you will be compacting that is rewriting multiple SS tables of unknown size, possibly very large into a new SS table with the new updates, or you must employ some means of time stamping the individual values or in the case of deletes employing something like tombstones or delete markers or timestamps
Then your reed engine needs to be able to read multiple SS tables and apply update ordering in order to find the most recent actual value.
This results in Io stress and reduced performance in the read path.
Lsms exist to scale writes to large indexes. Essentially what you were doing is you are queuing up the updates for a future process (compaction) to clean up.
In subversion writes replaced a node, then replaces the parent node all the way up to the root. So anyone doing a read in the middle of a merge only sees the old data because it saw the old root node. I assume MVCC systems do something similar, but for others I would think a lock per node allows for a lot less contention. And your project then relies very much on good indexes to avoid reading rows and thus parent nodes you don’t need.
Most MVCC but also some filesystems (ZFS) and various thread safe functional data structures use similar pattern of instead of actually modifying a tree they make new nodes that refer to old and new nodes then finally replace the tree root pointer atomically.
When I read this post last week I started wondering if a better data-structure could be an adaptive radix tree. But never finished thinking about it or returned to the article.
Reason being they are very fast for linear scan.
Persistent / CoW etc versions of them exist.
They're not too hard to implement.
Curious if this approach has been tried. My reading has mostly found ARTs used as indexes for in-memory DBs.
CoW adaptive radix trees are the entire basis of $WORK's in-memory database -- we use them to store everything, then use boring old btrees to handle sorting data in arbitrary ways. A nice, performant persistent CoW radix tree would be a nice thing to have in my back pocket.
Skiplist discussions that don't talk about heap effects are probably incomplete. The big limitation with skiplists for anything that isn't a "mostly-write-once" workload (something that, it should be noted, doesn't get much benefit from the atomic implementation discussed) is that the metadata record is variable sized, which does bad things to heap fragmentation and cache locality (two things you probably care deeply about if you're going so far as to junk your AVL tree for a lockless data structure) and isn't amenable to the slab-style optimizations used elsewhere.
I love skiplists for their simplicity, especially compared with the balanced trees against which they compete. But the truth is they're very one-trick-pony data structures and don't bring lot to the table as a general choice. Their home is pretty limited.
Skiplists can be very effective in disk structures for few of reasons.
For one, a sorted sequence of keys can be written to file in a single pass, and it's easy to interleave with the data itself. For example, Lucene uses multi-level skip lists in compressed posting list files in order to quickly jump to a term and then jump to the lowest matching document ID for that term. Since these files are immutable, the tree is only built once and all the data can be stored in sorted order, which has the added benefit that these files can also be merged in a single pass.
Append-only is a very convenient corner case for skiplists to optimize. I call it "skiplogs" because it looks really different from the general case. An average node can be 3 bytes, for example.
Doesn't the author mention using arenas?
Arenas are just heaps, and subject to all the same problems. It's true that in the LSM use case in question (and I need to point out that databases are not my area!) there is a clear "flush" operation that will put a limit on how much heap churn and fragmentation you need to tolerate. And maybe that's an answer that makes skiplists more desirable in this context. But things like long-lived app data and language runtimes don't have that, and skiplists turn into headaches.
A simpler consequence of the same problem: I mostly work in embedded stuff and skiplists are pretty much a non-starter as they can't easily be made intrusive without a bunch of waste.
Different sizes would be allocated in different arenas.
It's my understanding that in the context of append-only skip lists, like what RocksDB uses for memtables, you can see wasted space at the end of a block within an arena but no fragmentation between records within a block. SSTables are another story.
> It's generally my view that from a bird's eye view LSMs are simpler than on-disk Btrees, largely because they are more composable and a lot more of their data is immutable.
I dunno about this. On-disk B-trees can actually be pretty dang simple! Both conceptually and in implementation.
In concept, for an on-disk B-tree "table", you've got:
1. one growable file, or a sequence of fixed-size split files, abstracted over into a "vector of pages" type that you can 1. find the base pointer of, 2. overwrite a full page of, or 3. extend;
2. a "metapage" type, that works like the world's most trivial filesystem: it owns a "freelist" of page numbers, and keeps a pointer to the live root of the B-tree; and it embeds its own hash, and a version number.
3. a mechanism for discovering/choosing the newest live+valid metapage, essentially making this trivial "filesystem" into a trivial "journalling" filesystem. (in e.g. LMDB, you just have two metapages in pages 0 and 1, that are alternately written to, sort of like double-buffering; on DB open, the newer of the two by version is chosen if it's valid by hash; otherwise the older is used.)
4. a readers-writer lock (which doesn't need to be persisted to disk, because the on-disk file is never in a volatile/dirty state. You can just OS-advisory-lock the file — those locks disappear when the creator process dies, which is what you want here);
5. read-only transactions, that take a reader lock, discover the newest live metapage, dereference the B-tree root, and pass that pointer off to the user, letting them have at it, reading the B-tree arbitrarily as if it were an in-memory data structure;
6. read-write transactions, that take a writer lock, discover the newest live metapage, dereference the b-tree root, set up an in-memory dirtied pages map, and then let you read arbitrarily + write arbitrarily (with reads indirected through the dirtied-pages map, acting as an overlay);
7. Copy-on-Write updates to "clean" pages during rw txs, by cloning the clean page contents onto free[listed] pages, dirtying the clone with the update, and adding the dirty version of the page to the dirtied pages map, to overlay the original page;
8. a read-write tx commit op that "propagates" the changes from dirty pages, by Copy-on-Write rewriting the B-tree ancestors of the parent pages to point to the new dirty page numbers, until you derive a new B-tree root page — which you then create a new metapage for, fsync the file, and then write the metapage.
(Note that this last bit means that a read-write transaction will effectively implicitly "roll back" on crash/power cut — the data is written before the "journal entry" for it, so if the "journal" doesn't make it to disk, then the data updates — including the updated freelist — are "lost". All a failed tx has done is update free pages.)
Sure, that's eight things you need to implement. But none of them are complicated or unintuitive, the way that a "lock free concurrent skiplist" is. You could whiteboard any one of those abstractions above with a junior programmer, and they could probably figure out a basically-correct implementation just from the descriptions above.
And that means that the implementations of on-disk B-trees are usually pretty short-and-sweet.
LMDB, for example, is one C file (https://github.com/LMDB/lmdb/blob/mdb.master/libraries/liblm...) with ~8k of actual code (according to https://github.com/AlDanial/cloc).
Postgres's durable B-tree implementation (https://github.com/postgres/postgres/tree/master/src/backend...) is 22kloc, and ~10kloc of actual code.
Lsms and sstables involve compaction in all data write patterns that involve updates or deletes.
I don't know how Rockdb handles this, but this is a very very non-trivial problem in Cassandra.
Let's say your update spans a row of data that is stored across multiple sstables (the row of data, that is a set of column values, was formed in multiple updates that spanned multiple flushes of the mentable to disk/sstables.
So either as part of your update you will be compacting that is rewriting multiple SS tables of unknown size, possibly very large into a new SS table with the new updates, or you must employ some means of time stamping the individual values or in the case of deletes employing something like tombstones or delete markers or timestamps
Then your reed engine needs to be able to read multiple SS tables and apply update ordering in order to find the most recent actual value.
This results in Io stress and reduced performance in the read path.
Lsms exist to scale writes to large indexes. Essentially what you were doing is you are queuing up the updates for a future process (compaction) to clean up.
In subversion writes replaced a node, then replaces the parent node all the way up to the root. So anyone doing a read in the middle of a merge only sees the old data because it saw the old root node. I assume MVCC systems do something similar, but for others I would think a lock per node allows for a lot less contention. And your project then relies very much on good indexes to avoid reading rows and thus parent nodes you don’t need.
Most MVCC but also some filesystems (ZFS) and various thread safe functional data structures use similar pattern of instead of actually modifying a tree they make new nodes that refer to old and new nodes then finally replace the tree root pointer atomically.
In the multi writer skiplist code I think there's a leftover line from the previous example:
atomic.StorePointer(&prev.next, node)
Apart from that I really like the simulations!
whoops! thank you!
When I read this post last week I started wondering if a better data-structure could be an adaptive radix tree. But never finished thinking about it or returned to the article.
Reason being they are very fast for linear scan.
Persistent / CoW etc versions of them exist.
They're not too hard to implement.
Curious if this approach has been tried. My reading has mostly found ARTs used as indexes for in-memory DBs.
CoW adaptive radix trees are the entire basis of $WORK's in-memory database -- we use them to store everything, then use boring old btrees to handle sorting data in arbitrary ways. A nice, performant persistent CoW radix tree would be a nice thing to have in my back pocket.