Lock-free skip list map/set (#27)

Stjepan Glavina · web-flow · commit f8aa26db7b18 · 2018-02-24T23:52:14.000+01:00
* Lock-free skip list map/set

* Elaborate on alternative map/set interfaces

* Add get_with as another alternative
diff --git a/text/2018-01-14-skiplist.md b/text/2018-01-14-skiplist.md
@@ -0,0 +1,316 @@
+# Summary
+
+Introduce crate `crossbeam-skiplist` containing a lock-free skip list.
+
+The crate provides an ordered map and an ordered set akin to `BTreeMap` and `BTreeSet`.
+In the future, we might want to build other data structures on top of the
+skip list as well, for example a priority queue.
+
+# Motivation
+
+These are the first concurrent map and set data structures to be added to Crossbeam.
+
+Skip list is often touted as a relatively easy data structure to make concurrent, or at least
+easy compared to other maps/sets. However, supporting the remove operation in a language without
+GC and coming up with an API that isn't overly restrictive is very difficult.
+
+This crate aims to provide a concurrent map/set that aims to be as powerful and
+ergonomic as `BTreeMap`/`BTreeSet`. The API must be reasonably easy to use and provide all
+operations one would expect from any other map/set without jumping through hoops.
+
+A good example of that is Java - one can simply replace any use of [`TreeMap`] with
+[`ConcurrentSkipListMap`] and expect the code to *just work*. The concurrent map in Java is an
+almost perfect drop-in replacement for the non-concurrent one. Unfortunately, I don't think
+we can achieve quite the same thing in Rust - and the reason is that Java and Rust have much
+different memory models (e.g. Java doesn't have move semantics and pervasively allocates objects
+on the heap). With that said, I believe we can still model a lock-free map that mimics
+`BTreeMap` fairly closely.
+
+Regarding performance, a skip list is fundamentally disadvantaged compared to a B-tree.
+Every node in a skip list is separately allocated on the heap, while a B-tree
+allocates nodes in large blocks, thus greatly improving cache utilization. The problem
+of scattered skip list nodes in memory can be somewhat mitigated using custom allocators
+(by trying to allocate adjacent nodes in a skip list as close as possible in memory), but
+typically with great difficulty and underwhelming results.
+
+One can think of a B-tree as a kind of a compacting garbage collector. Consider what
+happens when a B-tree block becomes full: a new block may be allocated
+and elements are redistributed among blocks as needed. This is reminiscent of compacting
+garbage collectors. Note that moving elements in memory makes concurrency more difficult in Rust:
+a thread cannot hold a reference to an element if another thread may potentially move it
+to a different location in memory at the same time.
+
+Skip lists, however, allocate each node separately on the heap. A node contains a key, a value,
+and a tower of next-pointers. The node is never moved to a different location in memory - once
+allocated, it stays there until it is is destroyed. This makes cache utilization worse,
+but also makes borrowing elements in presence of parallel modify operations easier.
+
+Long story short, a lock-free skip list will scale much better than a mutex-protected `BTreeMap`,
+but in single-threaded scenarios it will have no chance competing with `BTreeMap` due to poorer
+cache utilization.
+
+[`ConcurrentSkipListMap`]: https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentSkipListMap.html
+[`TreeMap`]: https://docs.oracle.com/javase/8/docs/api/java/util/TreeMap.html
+
+### Previous work
+
+Notable implementations of concurrent skip lists in other languages:
+
+1. [java.util.concurrent](http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/concurrent/ConcurrentSkipListMap.java#ConcurrentSkipListMap) (Java):
+   Has the most extensive API - and really feels like a drop-in replacement for any other map.
+   However, the implementation is not particularly efficient and has a few interesting quirks.
+   For example, each pointer in a tower is separately allocated. Another one is - instead of
+   tagging pointers to mark a node as deleted, a dummy successor node is allocated.
+
+2. [libcds](https://github.com/khizmax/libcds/blob/19af81b7c61480ed705b91b4d01ee5d717a97cd2/cds/intrusive/skip_list_rcu.h) (C++):
+   Fairly complete and general API (one can even choose between no GC, EBR, and HP).
+
+3. [RocksDB](https://github.com/facebook/rocksdb/blob/68829ed89cec64186557dc0860fc693c118ff1c6/memtable/skiplist.h) (C++):
+   The skip list does not support removal nor multiple concurrent inserts. However, an ongoing
+   insert operation does not block other threads from reading the skip list. Once the skip list
+   becomes full, it is flushed to disk-based storage and a new skip list is constructed to replace
+   the old one.
+
+4. [Folly](https://github.com/facebook/folly/blob/98d1077ce0603b0713353d638cb1436a28827af6/folly/ConcurrentSkipList.h) (C++):
+   A concurrent skip list, but not lock-free: it uses per-node locking.
+   Also, removed nodes are not freed until the skip list is destroyed.
+
+5. [libgee](https://github.com/GNOME/libgee/blob/da95e830524ffa309eb57925320666e5085b9d66/gee/concurrentset.vala) (Vala):
+   A hazard pointer-based skip list. Looks very interesting.
+
+There are also several concurrent skip lists in Rust, but none of them have been published to crates.io so far
+and look like works in progress:
+
+1. [danburkert/pawn](https://github.com/danburkert/pawn/blob/8b6806d944d830f552d496cd3ee605d1707fdc51/src/util/skip_list.rs) (Rust):
+   A rather old insert-only lock-free skip list. Looks like an abandoned project.
+
+2. [Vtec234/lists-rs](https://github.com/Vtec234/lists-rs/blob/f83e516039dc4a421172af1cdbdcec85b0e73d74/src/epoch_skiplist.rs) (Rust):
+   A lock-free skip list that supports remove and uses Crossbeam for memory reclamation.
+   Interestingly, keys are always hashed so it's technically a hash map.
+ 
+3. [boats/skiplist](https://gitlab.com/boats/skiplist/tree/master/src/skiplist) (Rust):
+   Insert-only lock-free skip list by withoutboats. Published very recently.
+
+# Detailed design
+
+The proposed implementation is currently residing in [stjepang/skiplist](https://github.com/stjepang/skiplist),
+but will be moved into a new repository `crossbeam-rs/crossbeam-skiplist`. It is a
+lock-free skip list using epoch-based memory reclamation from `crossbeam-epoch`.
+
+The implementation is based on the following work:
+
+1. [Practical lock-freedom](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-579.pdf)
+   (see *4.3.3 CAS-based design*)
+
+2. [Linked Lists: Locking, Lock-Free and Beyond...](http://janvitek.org/events/TiC06/B-SLIDES/mh2.pdf)
+
+The codebase consists of three main source files:
+
+* [`base.rs`](https://github.com/stjepang/skiplist/blob/master/src/base.rs) -
+  Contains the base skip list implementation details. This file does not attempt to
+  expose something ergonomic, but instead aims to provide a skip list 'engine' that
+  is intended to be wrapped into a nicer interface.
+
+* [`map.rs`](https://github.com/stjepang/skiplist/blob/master/src/map.rs) -
+  Wraps the base implementation into a map interface similar to `BTreeMap`.
+
+* [`set.rs`](https://github.com/stjepang/skiplist/blob/master/src/set.rs) -
+  Wraps the base implementation into a set interface similar to `BTreeSet`.
+
+**Note:** These map and set wrappers are just tentative interfaces - they're
+finished, and there's a possibility we'll want to completely change them.
+For now, consider them just a proof of concept.
+
+## Tentative map API
+
+Here's a quick demo. The following code is the [first example](https://doc.rust-lang.org/std/collections/struct.BTreeMap.html#examples)
+taken from `BTreeMap`'s documentation, except it uses `SkipMap` instead of `BTreeMap`.
+A few other minor changes were required to make it compile, but other than that it doesn't depart
+too far from the original:
+
+```rust
+// type inference lets us omit an explicit type signature (which
+// would be `SkipMap<&str, &str>` in this example).
+let movie_reviews = SkipMap::new();
+
+// review some movies.
+movie_reviews.insert("Office Space",       "Deals with real issues in the workplace.");
+movie_reviews.insert("Pulp Fiction",       "Masterpiece.");
+movie_reviews.insert("The Godfather",      "Very enjoyable.");
+movie_reviews.insert("The Blues Brothers", "Eye lyked it alot.");
+
+// check for a specific one.
+if !movie_reviews.contains_key("Les Misérables") {
+    println!("We've got {} reviews, but Les Misérables ain't one.",
+             movie_reviews.len());
+}
+
+// oops, this review has a lot of spelling mistakes, let's delete it.
+movie_reviews.remove("The Blues Brothers");
+
+// look up the values associated with some keys.
+let to_find = ["Up!", "Office Space"];
+for book in &to_find {
+    match movie_reviews.get(book) {
+       Some(entry) => println!("{}: {}", book, entry.value()),
+       None => println!("{} is unreviewed.", book)
+    }
+}
+
+// iterate over everything.
+for entry in &movie_reviews {
+    let movie = entry.key();
+    let review = entry.value();
+    println!("{}: \"{}\"", movie, review);
+}
+```
+
+Take a look at [map.rs](https://github.com/stjepang/skiplist/blob/master/src/map.rs) to
+see the full interface of `SkipMap`.
+
+An interesting difference from `BTreeMap` is that methods like `insert` and `get` return
+an `Entry<'a, K, V>`, which is essentially just a reference-counted pointer to an entry in
+the skip list. Note that it is possible to hold an entry and remove it at the same time (you
+can even call `entry.remove()`), but the actual contents of the entry will not be destroyed
+before the last reference is dropped.
+
+### Performance
+
+It has already been mentioned that `SkipMap` will have a hard time competing with `BTreeMap` in
+single-threaded scenarios.
+Let's see that through a simple benchmark that just inserts a million pseudorandom
+numbers into a map. This is a very unscientific benchmark, but it should at least give
+us a feeling for how different map implementations fare against each other.
+
+Machine: Intel Core i7-5600U (2 physical cores, 4 logical cores)
+
+First, here's `BTreeMap` in three different scenarios:
+
+* [`BTreeMap<u64, u64>` (1 thread)](https://gist.github.com/stjepang/9b1bf73c2fdb0309cefda66b91f633dd): 315 ms
+* [`Mutex<BTreeMap<u64, u64>>` (1 thread)](https://gist.github.com/stjepang/437b82134b401d3fa2c9c439a003c1ea): 321 ms
+* [`Mutex<BTreeMap<u64, u64>>` (2 threads)](https://gist.github.com/stjepang/66000dfae15c8046b91ff3612c7d881f): 752 ms
+
+Notice how there is very little overhead of locking if only one thread is used. However, as soon as
+we add more threads, contended locking brings a huge penalty on performance.
+
+But `SkipMap` doesn't suffer from the same problem. In fact, adding more threads improves performance:
+
+* [`SkipMap<u64, u64>` (1 thread)](https://gist.github.com/stjepang/1980ab811009e94f2adfe8b230c20047): 1028 ms
+* [`SkipMap<u64, u64>` (2 threads)](https://gist.github.com/stjepang/a3f8f6dddac56d43e7dbfb2928cd3bfe): 561 ms
+
+Let's also see some numbers for a mutex-protected `std::map` in C++:
+
+* [`std::map<uint64_t, uint64_t>` (1 thread)](https://gist.github.com/stjepang/6aa80020b6edac1f6ea9af518e4ad989): 881 ms
+* [`std::map<uint64_t, uint64_t>` (2 threads)](https://gist.github.com/stjepang/b172a4259c0439d2855bc68fd47b3ab7): 1127 ms
+
+And here are mutex-protected `TreeMap` and `ConcurrentSkipListMap` in Java:
+
+* [`TreeMap<long, long> (1 thread)`](https://gist.github.com/stjepang/3bc21528f5cf82ecd564778f8a861b11): 1211 ms
+* [`TreeMap<long, long> (2 threads)`](https://gist.github.com/stjepang/da69ad273ea2cf2e13b4322c0ea6bd74): 1409 ms
+* [`ConcurrentSkipListMap<long, long> (1 thread)`](https://gist.github.com/stjepang/f6f289c07759f47a72b0565fd6b992c7): 2181 ms
+* [`ConcurrentSkipListMap<long, long> (2 threads)`](https://gist.github.com/stjepang/74d1abc7230ad6e6dd0c4aec1f4cab4b): 1353 ms
+
+The bottom line is: in single-threaded scenarios `SkipMap` should be comparable in performance
+to any typical binary search tree (although not to a B-tree). As we add more threads, it seems to
+scale quite well. I don't have a machine with a high number of cores to test scalability more thoroughly,
+but these numbers seem promising so far nonetheless.
+
+### Iteration
+
+The skip list supports easy iteration. Note that when iterating over a `SkipMap` we hand out an `Entry`
+for each entry in it. Creating an entry involves incrementing its reference count, and when moving
+from one entry to another we also have to pin the current thread. This is a lot of reference count
+updating and a lot of pinning.
+
+Here are some benchmark numbers for iterating over a million randomly inserted entries:
+
+* `BTreeMap` (Rust): 18 ms
+* `SkipMap` (Rust): 113 ms
+* `std::map` (C++): 93 ms
+* `TreeMap` (Java): 41 ms
+* `ConcurrentSkipListMap` (Java): 32 ms
+
+Interesting observations:
+
+* Iteration over a `BTreeMap` is very fast - this shouldn't be surprising since adjacent
+  elements are grouped into blocks.
+
+* `SkipMap` is the slowest map here. I've tried measuring how long iteration takes withut
+  reference count updating and without pinning, and it turns out to be around 95 ms. That is
+  very similar to `std::map` in C++. Also, reference counting and pinning definitely incurs
+  a measurable cost, but it's not a *terrible* one.
+
+* Java is fast - even `TreeMap` is faster than `std::map` in C++. How is that possible?
+  Well, the answer lies in the fact that Java's GC kicks from time to time, moves
+  allocated nodes in memory (it's a compacting GC), and tries to lay out linked nodes
+  as close as possible, thus optimizing for cache efficiency.
+
+* Let's try tuning Java's GC by using option `-XX:NewSize=1024m`. This option sets the
+  size of the new generation to 1024 MB (a huge number), which means compaction should
+  never kick in. Indeed, iteration timings are much different now - with `TreeMap` it takes
+  124 ms and with `ConcurrentSkipListMap` it takes 110 ms. Now that's much closer to `SkipMap`
+  and `std::map`.
+
+### The cost of reference counting in `Entry`
+
+When iterating over a skip list we use `Entry`s, which are essentially reference-counted
+pointers to skip list nodes. That means iterating over 100 elements incurs the
+cost of 200 atomic increments and 200 atomic decrements.
+
+Methods that insert, remove, or search for an element return `Entry`, which means
+they too incur some cost spent on incrementing and decrementing a node's reference count.
+
+The current skip list implementation doesn't provide alternative methods that avoid
+reference counting (i.e. avoid using `Entry`), but in the future we should discuss how
+to add them. Broadly speaking, there are three general alternatives to entries:
+clones, guards, and closures. Here's an illustration on the `SkipMap::get` method:
+
+```rust
+// Reference counting: return an `Entry`.
+//
+// This is the method signature we currently have.
+fn get(&self, k: &K) -> Option<Entry<K, V>>;
+
+// Alternative #1: return a clone of the element.
+//
+// This means we're paying the price of cloning, but that's
+// not a problem if cloning is cheap.
+fn get_clone(&self, k: &K) -> Option<V> where V: Clone;
+
+// Alternative #2: return a guarded reference to the element
+// keeping the thread pinned.
+//
+// The main drawback here is that the user must be careful
+// not to keep the guard live for too long, or else garbage
+// collection will get stuck.
+fn get_guard(&self, k: &K) -> Option<Guard<K, V>>;
+
+// Alternative #3: take a closure that does something with
+// the found element while the thread is still pinned.
+//
+// Again, the drawback here is that the user must be careful
+// not to keep the closure running for too long, or else
+// garbage collection will get stuck.
+fn get_with<F: FnOnce(&V)>(&self, k: &K, f: F);
+```
+
+# Drawbacks
+
+Skip lists are not very exciting when it comes to performance. Hash tables, B-trees (Bw-Tree is
+a lock-free B-tree variant), and radix trees (ART - adaptive radix tree can be made concurrent)
+are usually more performant. However, these faster data structures are not as general as skip lists
+and have to make sacrifices by restricting the set of supported operations or by making the API less ergonomic.
+
+# Alternatives
+
+A few possible similar but alternative data structures might be:
+
+1. Adaptive radix tree (keys can only be byte arrays).
+2. Skip tree (moves elements in memory, thus constraining the API).
+3. Bw-Tree (moves elements in memory, thus constraining the API).
+
+# Unresolved questions
+
+* Should `Entry` be renamed to `Cursor`?
+* How do we make iteration faster by avoiding reference counting?
+* What alternatives to the `Entry` API do we need and how to incorporate them?