|
| 1 | +# Summary |
| 2 | + |
| 3 | +Introduce crate `crossbeam-skiplist` containing a lock-free skip list. |
| 4 | + |
| 5 | +The crate provides an ordered map and an ordered set akin to `BTreeMap` and `BTreeSet`. |
| 6 | +In the future, we might want to build other data structures on top of the |
| 7 | +skip list as well, for example a priority queue. |
| 8 | + |
| 9 | +# Motivation |
| 10 | + |
| 11 | +These are the first concurrent map and set data structures to be added to Crossbeam. |
| 12 | + |
| 13 | +Skip list is often touted as a relatively easy data structure to make concurrent, or at least |
| 14 | +easy compared to other maps/sets. However, supporting the remove operation in a language without |
| 15 | +GC and coming up with an API that isn't overly restrictive is very difficult. |
| 16 | + |
| 17 | +This crate aims to provide a concurrent map/set that aims to be as powerful and |
| 18 | +ergonomic as `BTreeMap`/`BTreeSet`. The API must be reasonably easy to use and provide all |
| 19 | +operations one would expect from any other map/set without jumping through hoops. |
| 20 | + |
| 21 | +A good example of that is Java - one can simply replace any use of [`TreeMap`] with |
| 22 | +[`ConcurrentSkipListMap`] and expect the code to *just work*. The concurrent map in Java is an |
| 23 | +almost perfect drop-in replacement for the non-concurrent one. Unfortunately, I don't think |
| 24 | +we can achieve quite the same thing in Rust - and the reason is that Java and Rust have much |
| 25 | +different memory models (e.g. Java doesn't have move semantics and pervasively allocates objects |
| 26 | +on the heap). With that said, I believe we can still model a lock-free map that mimics |
| 27 | +`BTreeMap` fairly closely. |
| 28 | + |
| 29 | +Regarding performance, a skip list is fundamentally disadvantaged compared to a B-tree. |
| 30 | +Every node in a skip list is separately allocated on the heap, while a B-tree |
| 31 | +allocates nodes in large blocks, thus greatly improving cache utilization. The problem |
| 32 | +of scattered skip list nodes in memory can be somewhat mitigated using custom allocators |
| 33 | +(by trying to allocate adjacent nodes in a skip list as close as possible in memory), but |
| 34 | +typically with great difficulty and underwhelming results. |
| 35 | + |
| 36 | +One can think of a B-tree as a kind of a compacting garbage collector. Consider what |
| 37 | +happens when a B-tree block becomes full: a new block may be allocated |
| 38 | +and elements are redistributed among blocks as needed. This is reminiscent of compacting |
| 39 | +garbage collectors. Note that moving elements in memory makes concurrency more difficult in Rust: |
| 40 | +a thread cannot hold a reference to an element if another thread may potentially move it |
| 41 | +to a different location in memory at the same time. |
| 42 | + |
| 43 | +Skip lists, however, allocate each node separately on the heap. A node contains a key, a value, |
| 44 | +and a tower of next-pointers. The node is never moved to a different location in memory - once |
| 45 | +allocated, it stays there until it is is destroyed. This makes cache utilization worse, |
| 46 | +but also makes borrowing elements in presence of parallel modify operations easier. |
| 47 | + |
| 48 | +Long story short, a lock-free skip list will scale much better than a mutex-protected `BTreeMap`, |
| 49 | +but in single-threaded scenarios it will have no chance competing with `BTreeMap` due to poorer |
| 50 | +cache utilization. |
| 51 | + |
| 52 | +[`ConcurrentSkipListMap`]: https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentSkipListMap.html |
| 53 | +[`TreeMap`]: https://docs.oracle.com/javase/8/docs/api/java/util/TreeMap.html |
| 54 | + |
| 55 | +### Previous work |
| 56 | + |
| 57 | +Notable implementations of concurrent skip lists in other languages: |
| 58 | + |
| 59 | +1. [java.util.concurrent](http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/concurrent/ConcurrentSkipListMap.java#ConcurrentSkipListMap) (Java): |
| 60 | + Has the most extensive API - and really feels like a drop-in replacement for any other map. |
| 61 | + However, the implementation is not particularly efficient and has a few interesting quirks. |
| 62 | + For example, each pointer in a tower is separately allocated. Another one is - instead of |
| 63 | + tagging pointers to mark a node as deleted, a dummy successor node is allocated. |
| 64 | + |
| 65 | +2. [libcds](https://github.com/khizmax/libcds/blob/19af81b7c61480ed705b91b4d01ee5d717a97cd2/cds/intrusive/skip_list_rcu.h) (C++): |
| 66 | + Fairly complete and general API (one can even choose between no GC, EBR, and HP). |
| 67 | + |
| 68 | +3. [RocksDB](https://github.com/facebook/rocksdb/blob/68829ed89cec64186557dc0860fc693c118ff1c6/memtable/skiplist.h) (C++): |
| 69 | + The skip list does not support removal nor multiple concurrent inserts. However, an ongoing |
| 70 | + insert operation does not block other threads from reading the skip list. Once the skip list |
| 71 | + becomes full, it is flushed to disk-based storage and a new skip list is constructed to replace |
| 72 | + the old one. |
| 73 | + |
| 74 | +4. [Folly](https://github.com/facebook/folly/blob/98d1077ce0603b0713353d638cb1436a28827af6/folly/ConcurrentSkipList.h) (C++): |
| 75 | + A concurrent skip list, but not lock-free: it uses per-node locking. |
| 76 | + Also, removed nodes are not freed until the skip list is destroyed. |
| 77 | + |
| 78 | +5. [libgee](https://github.com/GNOME/libgee/blob/da95e830524ffa309eb57925320666e5085b9d66/gee/concurrentset.vala) (Vala): |
| 79 | + A hazard pointer-based skip list. Looks very interesting. |
| 80 | + |
| 81 | +There are also several concurrent skip lists in Rust, but none of them have been published to crates.io so far |
| 82 | +and look like works in progress: |
| 83 | + |
| 84 | +1. [danburkert/pawn](https://github.com/danburkert/pawn/blob/8b6806d944d830f552d496cd3ee605d1707fdc51/src/util/skip_list.rs) (Rust): |
| 85 | + A rather old insert-only lock-free skip list. Looks like an abandoned project. |
| 86 | + |
| 87 | +2. [Vtec234/lists-rs](https://github.com/Vtec234/lists-rs/blob/f83e516039dc4a421172af1cdbdcec85b0e73d74/src/epoch_skiplist.rs) (Rust): |
| 88 | + A lock-free skip list that supports remove and uses Crossbeam for memory reclamation. |
| 89 | + Interestingly, keys are always hashed so it's technically a hash map. |
| 90 | + |
| 91 | +3. [boats/skiplist](https://gitlab.com/boats/skiplist/tree/master/src/skiplist) (Rust): |
| 92 | + Insert-only lock-free skip list by withoutboats. Published very recently. |
| 93 | + |
| 94 | +# Detailed design |
| 95 | + |
| 96 | +The proposed implementation is currently residing in [stjepang/skiplist](https://github.com/stjepang/skiplist), |
| 97 | +but will be moved into a new repository `crossbeam-rs/crossbeam-skiplist`. It is a |
| 98 | +lock-free skip list using epoch-based memory reclamation from `crossbeam-epoch`. |
| 99 | + |
| 100 | +The implementation is based on the following work: |
| 101 | + |
| 102 | +1. [Practical lock-freedom](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-579.pdf) |
| 103 | + (see *4.3.3 CAS-based design*) |
| 104 | + |
| 105 | +2. [Linked Lists: Locking, Lock-Free and Beyond...](http://janvitek.org/events/TiC06/B-SLIDES/mh2.pdf) |
| 106 | + |
| 107 | +The codebase consists of three main source files: |
| 108 | + |
| 109 | +* [`base.rs`](https://github.com/stjepang/skiplist/blob/master/src/base.rs) - |
| 110 | + Contains the base skip list implementation details. This file does not attempt to |
| 111 | + expose something ergonomic, but instead aims to provide a skip list 'engine' that |
| 112 | + is intended to be wrapped into a nicer interface. |
| 113 | + |
| 114 | +* [`map.rs`](https://github.com/stjepang/skiplist/blob/master/src/map.rs) - |
| 115 | + Wraps the base implementation into a map interface similar to `BTreeMap`. |
| 116 | + |
| 117 | +* [`set.rs`](https://github.com/stjepang/skiplist/blob/master/src/set.rs) - |
| 118 | + Wraps the base implementation into a set interface similar to `BTreeSet`. |
| 119 | + |
| 120 | +**Note:** These map and set wrappers are just tentative interfaces - they're |
| 121 | +finished, and there's a possibility we'll want to completely change them. |
| 122 | +For now, consider them just a proof of concept. |
| 123 | + |
| 124 | +## Tentative map API |
| 125 | + |
| 126 | +Here's a quick demo. The following code is the [first example](https://doc.rust-lang.org/std/collections/struct.BTreeMap.html#examples) |
| 127 | +taken from `BTreeMap`'s documentation, except it uses `SkipMap` instead of `BTreeMap`. |
| 128 | +A few other minor changes were required to make it compile, but other than that it doesn't depart |
| 129 | +too far from the original: |
| 130 | + |
| 131 | +```rust |
| 132 | +// type inference lets us omit an explicit type signature (which |
| 133 | +// would be `SkipMap<&str, &str>` in this example). |
| 134 | +let movie_reviews = SkipMap::new(); |
| 135 | + |
| 136 | +// review some movies. |
| 137 | +movie_reviews.insert("Office Space", "Deals with real issues in the workplace."); |
| 138 | +movie_reviews.insert("Pulp Fiction", "Masterpiece."); |
| 139 | +movie_reviews.insert("The Godfather", "Very enjoyable."); |
| 140 | +movie_reviews.insert("The Blues Brothers", "Eye lyked it alot."); |
| 141 | + |
| 142 | +// check for a specific one. |
| 143 | +if !movie_reviews.contains_key("Les Misérables") { |
| 144 | + println!("We've got {} reviews, but Les Misérables ain't one.", |
| 145 | + movie_reviews.len()); |
| 146 | +} |
| 147 | + |
| 148 | +// oops, this review has a lot of spelling mistakes, let's delete it. |
| 149 | +movie_reviews.remove("The Blues Brothers"); |
| 150 | + |
| 151 | +// look up the values associated with some keys. |
| 152 | +let to_find = ["Up!", "Office Space"]; |
| 153 | +for book in &to_find { |
| 154 | + match movie_reviews.get(book) { |
| 155 | + Some(entry) => println!("{}: {}", book, entry.value()), |
| 156 | + None => println!("{} is unreviewed.", book) |
| 157 | + } |
| 158 | +} |
| 159 | + |
| 160 | +// iterate over everything. |
| 161 | +for entry in &movie_reviews { |
| 162 | + let movie = entry.key(); |
| 163 | + let review = entry.value(); |
| 164 | + println!("{}: \"{}\"", movie, review); |
| 165 | +} |
| 166 | +``` |
| 167 | + |
| 168 | +Take a look at [map.rs](https://github.com/stjepang/skiplist/blob/master/src/map.rs) to |
| 169 | +see the full interface of `SkipMap`. |
| 170 | + |
| 171 | +An interesting difference from `BTreeMap` is that methods like `insert` and `get` return |
| 172 | +an `Entry<'a, K, V>`, which is essentially just a reference-counted pointer to an entry in |
| 173 | +the skip list. Note that it is possible to hold an entry and remove it at the same time (you |
| 174 | +can even call `entry.remove()`), but the actual contents of the entry will not be destroyed |
| 175 | +before the last reference is dropped. |
| 176 | + |
| 177 | +### Performance |
| 178 | + |
| 179 | +It has already been mentioned that `SkipMap` will have a hard time competing with `BTreeMap` in |
| 180 | +single-threaded scenarios. |
| 181 | +Let's see that through a simple benchmark that just inserts a million pseudorandom |
| 182 | +numbers into a map. This is a very unscientific benchmark, but it should at least give |
| 183 | +us a feeling for how different map implementations fare against each other. |
| 184 | + |
| 185 | +Machine: Intel Core i7-5600U (2 physical cores, 4 logical cores) |
| 186 | + |
| 187 | +First, here's `BTreeMap` in three different scenarios: |
| 188 | + |
| 189 | +* [`BTreeMap<u64, u64>` (1 thread)](https://gist.github.com/stjepang/9b1bf73c2fdb0309cefda66b91f633dd): 315 ms |
| 190 | +* [`Mutex<BTreeMap<u64, u64>>` (1 thread)](https://gist.github.com/stjepang/437b82134b401d3fa2c9c439a003c1ea): 321 ms |
| 191 | +* [`Mutex<BTreeMap<u64, u64>>` (2 threads)](https://gist.github.com/stjepang/66000dfae15c8046b91ff3612c7d881f): 752 ms |
| 192 | + |
| 193 | +Notice how there is very little overhead of locking if only one thread is used. However, as soon as |
| 194 | +we add more threads, contended locking brings a huge penalty on performance. |
| 195 | + |
| 196 | +But `SkipMap` doesn't suffer from the same problem. In fact, adding more threads improves performance: |
| 197 | + |
| 198 | +* [`SkipMap<u64, u64>` (1 thread)](https://gist.github.com/stjepang/1980ab811009e94f2adfe8b230c20047): 1028 ms |
| 199 | +* [`SkipMap<u64, u64>` (2 threads)](https://gist.github.com/stjepang/a3f8f6dddac56d43e7dbfb2928cd3bfe): 561 ms |
| 200 | + |
| 201 | +Let's also see some numbers for a mutex-protected `std::map` in C++: |
| 202 | + |
| 203 | +* [`std::map<uint64_t, uint64_t>` (1 thread)](https://gist.github.com/stjepang/6aa80020b6edac1f6ea9af518e4ad989): 881 ms |
| 204 | +* [`std::map<uint64_t, uint64_t>` (2 threads)](https://gist.github.com/stjepang/b172a4259c0439d2855bc68fd47b3ab7): 1127 ms |
| 205 | + |
| 206 | +And here are mutex-protected `TreeMap` and `ConcurrentSkipListMap` in Java: |
| 207 | + |
| 208 | +* [`TreeMap<long, long> (1 thread)`](https://gist.github.com/stjepang/3bc21528f5cf82ecd564778f8a861b11): 1211 ms |
| 209 | +* [`TreeMap<long, long> (2 threads)`](https://gist.github.com/stjepang/da69ad273ea2cf2e13b4322c0ea6bd74): 1409 ms |
| 210 | +* [`ConcurrentSkipListMap<long, long> (1 thread)`](https://gist.github.com/stjepang/f6f289c07759f47a72b0565fd6b992c7): 2181 ms |
| 211 | +* [`ConcurrentSkipListMap<long, long> (2 threads)`](https://gist.github.com/stjepang/74d1abc7230ad6e6dd0c4aec1f4cab4b): 1353 ms |
| 212 | + |
| 213 | +The bottom line is: in single-threaded scenarios `SkipMap` should be comparable in performance |
| 214 | +to any typical binary search tree (although not to a B-tree). As we add more threads, it seems to |
| 215 | +scale quite well. I don't have a machine with a high number of cores to test scalability more thoroughly, |
| 216 | +but these numbers seem promising so far nonetheless. |
| 217 | + |
| 218 | +### Iteration |
| 219 | + |
| 220 | +The skip list supports easy iteration. Note that when iterating over a `SkipMap` we hand out an `Entry` |
| 221 | +for each entry in it. Creating an entry involves incrementing its reference count, and when moving |
| 222 | +from one entry to another we also have to pin the current thread. This is a lot of reference count |
| 223 | +updating and a lot of pinning. |
| 224 | + |
| 225 | +Here are some benchmark numbers for iterating over a million randomly inserted entries: |
| 226 | + |
| 227 | +* `BTreeMap` (Rust): 18 ms |
| 228 | +* `SkipMap` (Rust): 113 ms |
| 229 | +* `std::map` (C++): 93 ms |
| 230 | +* `TreeMap` (Java): 41 ms |
| 231 | +* `ConcurrentSkipListMap` (Java): 32 ms |
| 232 | + |
| 233 | +Interesting observations: |
| 234 | + |
| 235 | +* Iteration over a `BTreeMap` is very fast - this shouldn't be surprising since adjacent |
| 236 | + elements are grouped into blocks. |
| 237 | + |
| 238 | +* `SkipMap` is the slowest map here. I've tried measuring how long iteration takes withut |
| 239 | + reference count updating and without pinning, and it turns out to be around 95 ms. That is |
| 240 | + very similar to `std::map` in C++. Also, reference counting and pinning definitely incurs |
| 241 | + a measurable cost, but it's not a *terrible* one. |
| 242 | + |
| 243 | +* Java is fast - even `TreeMap` is faster than `std::map` in C++. How is that possible? |
| 244 | + Well, the answer lies in the fact that Java's GC kicks from time to time, moves |
| 245 | + allocated nodes in memory (it's a compacting GC), and tries to lay out linked nodes |
| 246 | + as close as possible, thus optimizing for cache efficiency. |
| 247 | + |
| 248 | +* Let's try tuning Java's GC by using option `-XX:NewSize=1024m`. This option sets the |
| 249 | + size of the new generation to 1024 MB (a huge number), which means compaction should |
| 250 | + never kick in. Indeed, iteration timings are much different now - with `TreeMap` it takes |
| 251 | + 124 ms and with `ConcurrentSkipListMap` it takes 110 ms. Now that's much closer to `SkipMap` |
| 252 | + and `std::map`. |
| 253 | + |
| 254 | +### The cost of reference counting in `Entry` |
| 255 | + |
| 256 | +When iterating over a skip list we use `Entry`s, which are essentially reference-counted |
| 257 | +pointers to skip list nodes. That means iterating over 100 elements incurs the |
| 258 | +cost of 200 atomic increments and 200 atomic decrements. |
| 259 | + |
| 260 | +Methods that insert, remove, or search for an element return `Entry`, which means |
| 261 | +they too incur some cost spent on incrementing and decrementing a node's reference count. |
| 262 | + |
| 263 | +The current skip list implementation doesn't provide alternative methods that avoid |
| 264 | +reference counting (i.e. avoid using `Entry`), but in the future we should discuss how |
| 265 | +to add them. Broadly speaking, there are three general alternatives to entries: |
| 266 | +clones, guards, and closures. Here's an illustration on the `SkipMap::get` method: |
| 267 | + |
| 268 | +```rust |
| 269 | +// Reference counting: return an `Entry`. |
| 270 | +// |
| 271 | +// This is the method signature we currently have. |
| 272 | +fn get(&self, k: &K) -> Option<Entry<K, V>>; |
| 273 | + |
| 274 | +// Alternative #1: return a clone of the element. |
| 275 | +// |
| 276 | +// This means we're paying the price of cloning, but that's |
| 277 | +// not a problem if cloning is cheap. |
| 278 | +fn get_clone(&self, k: &K) -> Option<V> where V: Clone; |
| 279 | + |
| 280 | +// Alternative #2: return a guarded reference to the element |
| 281 | +// keeping the thread pinned. |
| 282 | +// |
| 283 | +// The main drawback here is that the user must be careful |
| 284 | +// not to keep the guard live for too long, or else garbage |
| 285 | +// collection will get stuck. |
| 286 | +fn get_guard(&self, k: &K) -> Option<Guard<K, V>>; |
| 287 | + |
| 288 | +// Alternative #3: take a closure that does something with |
| 289 | +// the found element while the thread is still pinned. |
| 290 | +// |
| 291 | +// Again, the drawback here is that the user must be careful |
| 292 | +// not to keep the closure running for too long, or else |
| 293 | +// garbage collection will get stuck. |
| 294 | +fn get_with<F: FnOnce(&V)>(&self, k: &K, f: F); |
| 295 | +``` |
| 296 | + |
| 297 | +# Drawbacks |
| 298 | + |
| 299 | +Skip lists are not very exciting when it comes to performance. Hash tables, B-trees (Bw-Tree is |
| 300 | +a lock-free B-tree variant), and radix trees (ART - adaptive radix tree can be made concurrent) |
| 301 | +are usually more performant. However, these faster data structures are not as general as skip lists |
| 302 | +and have to make sacrifices by restricting the set of supported operations or by making the API less ergonomic. |
| 303 | + |
| 304 | +# Alternatives |
| 305 | + |
| 306 | +A few possible similar but alternative data structures might be: |
| 307 | + |
| 308 | +1. Adaptive radix tree (keys can only be byte arrays). |
| 309 | +2. Skip tree (moves elements in memory, thus constraining the API). |
| 310 | +3. Bw-Tree (moves elements in memory, thus constraining the API). |
| 311 | + |
| 312 | +# Unresolved questions |
| 313 | + |
| 314 | +* Should `Entry` be renamed to `Cursor`? |
| 315 | +* How do we make iteration faster by avoiding reference counting? |
| 316 | +* What alternatives to the `Entry` API do we need and how to incorporate them? |
0 commit comments