Skip to content

Commit f8aa26d

Browse files
author
Stjepan Glavina
authored
Lock-free skip list map/set (#27)
* Lock-free skip list map/set * Elaborate on alternative map/set interfaces * Add get_with as another alternative
1 parent 2e22d0e commit f8aa26d

1 file changed

Lines changed: 316 additions & 0 deletions

File tree

text/2018-01-14-skiplist.md

Lines changed: 316 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,316 @@
1+
# Summary
2+
3+
Introduce crate `crossbeam-skiplist` containing a lock-free skip list.
4+
5+
The crate provides an ordered map and an ordered set akin to `BTreeMap` and `BTreeSet`.
6+
In the future, we might want to build other data structures on top of the
7+
skip list as well, for example a priority queue.
8+
9+
# Motivation
10+
11+
These are the first concurrent map and set data structures to be added to Crossbeam.
12+
13+
Skip list is often touted as a relatively easy data structure to make concurrent, or at least
14+
easy compared to other maps/sets. However, supporting the remove operation in a language without
15+
GC and coming up with an API that isn't overly restrictive is very difficult.
16+
17+
This crate aims to provide a concurrent map/set that aims to be as powerful and
18+
ergonomic as `BTreeMap`/`BTreeSet`. The API must be reasonably easy to use and provide all
19+
operations one would expect from any other map/set without jumping through hoops.
20+
21+
A good example of that is Java - one can simply replace any use of [`TreeMap`] with
22+
[`ConcurrentSkipListMap`] and expect the code to *just work*. The concurrent map in Java is an
23+
almost perfect drop-in replacement for the non-concurrent one. Unfortunately, I don't think
24+
we can achieve quite the same thing in Rust - and the reason is that Java and Rust have much
25+
different memory models (e.g. Java doesn't have move semantics and pervasively allocates objects
26+
on the heap). With that said, I believe we can still model a lock-free map that mimics
27+
`BTreeMap` fairly closely.
28+
29+
Regarding performance, a skip list is fundamentally disadvantaged compared to a B-tree.
30+
Every node in a skip list is separately allocated on the heap, while a B-tree
31+
allocates nodes in large blocks, thus greatly improving cache utilization. The problem
32+
of scattered skip list nodes in memory can be somewhat mitigated using custom allocators
33+
(by trying to allocate adjacent nodes in a skip list as close as possible in memory), but
34+
typically with great difficulty and underwhelming results.
35+
36+
One can think of a B-tree as a kind of a compacting garbage collector. Consider what
37+
happens when a B-tree block becomes full: a new block may be allocated
38+
and elements are redistributed among blocks as needed. This is reminiscent of compacting
39+
garbage collectors. Note that moving elements in memory makes concurrency more difficult in Rust:
40+
a thread cannot hold a reference to an element if another thread may potentially move it
41+
to a different location in memory at the same time.
42+
43+
Skip lists, however, allocate each node separately on the heap. A node contains a key, a value,
44+
and a tower of next-pointers. The node is never moved to a different location in memory - once
45+
allocated, it stays there until it is is destroyed. This makes cache utilization worse,
46+
but also makes borrowing elements in presence of parallel modify operations easier.
47+
48+
Long story short, a lock-free skip list will scale much better than a mutex-protected `BTreeMap`,
49+
but in single-threaded scenarios it will have no chance competing with `BTreeMap` due to poorer
50+
cache utilization.
51+
52+
[`ConcurrentSkipListMap`]: https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentSkipListMap.html
53+
[`TreeMap`]: https://docs.oracle.com/javase/8/docs/api/java/util/TreeMap.html
54+
55+
### Previous work
56+
57+
Notable implementations of concurrent skip lists in other languages:
58+
59+
1. [java.util.concurrent](http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/concurrent/ConcurrentSkipListMap.java#ConcurrentSkipListMap) (Java):
60+
Has the most extensive API - and really feels like a drop-in replacement for any other map.
61+
However, the implementation is not particularly efficient and has a few interesting quirks.
62+
For example, each pointer in a tower is separately allocated. Another one is - instead of
63+
tagging pointers to mark a node as deleted, a dummy successor node is allocated.
64+
65+
2. [libcds](https://github.com/khizmax/libcds/blob/19af81b7c61480ed705b91b4d01ee5d717a97cd2/cds/intrusive/skip_list_rcu.h) (C++):
66+
Fairly complete and general API (one can even choose between no GC, EBR, and HP).
67+
68+
3. [RocksDB](https://github.com/facebook/rocksdb/blob/68829ed89cec64186557dc0860fc693c118ff1c6/memtable/skiplist.h) (C++):
69+
The skip list does not support removal nor multiple concurrent inserts. However, an ongoing
70+
insert operation does not block other threads from reading the skip list. Once the skip list
71+
becomes full, it is flushed to disk-based storage and a new skip list is constructed to replace
72+
the old one.
73+
74+
4. [Folly](https://github.com/facebook/folly/blob/98d1077ce0603b0713353d638cb1436a28827af6/folly/ConcurrentSkipList.h) (C++):
75+
A concurrent skip list, but not lock-free: it uses per-node locking.
76+
Also, removed nodes are not freed until the skip list is destroyed.
77+
78+
5. [libgee](https://github.com/GNOME/libgee/blob/da95e830524ffa309eb57925320666e5085b9d66/gee/concurrentset.vala) (Vala):
79+
A hazard pointer-based skip list. Looks very interesting.
80+
81+
There are also several concurrent skip lists in Rust, but none of them have been published to crates.io so far
82+
and look like works in progress:
83+
84+
1. [danburkert/pawn](https://github.com/danburkert/pawn/blob/8b6806d944d830f552d496cd3ee605d1707fdc51/src/util/skip_list.rs) (Rust):
85+
A rather old insert-only lock-free skip list. Looks like an abandoned project.
86+
87+
2. [Vtec234/lists-rs](https://github.com/Vtec234/lists-rs/blob/f83e516039dc4a421172af1cdbdcec85b0e73d74/src/epoch_skiplist.rs) (Rust):
88+
A lock-free skip list that supports remove and uses Crossbeam for memory reclamation.
89+
Interestingly, keys are always hashed so it's technically a hash map.
90+
91+
3. [boats/skiplist](https://gitlab.com/boats/skiplist/tree/master/src/skiplist) (Rust):
92+
Insert-only lock-free skip list by withoutboats. Published very recently.
93+
94+
# Detailed design
95+
96+
The proposed implementation is currently residing in [stjepang/skiplist](https://github.com/stjepang/skiplist),
97+
but will be moved into a new repository `crossbeam-rs/crossbeam-skiplist`. It is a
98+
lock-free skip list using epoch-based memory reclamation from `crossbeam-epoch`.
99+
100+
The implementation is based on the following work:
101+
102+
1. [Practical lock-freedom](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-579.pdf)
103+
(see *4.3.3 CAS-based design*)
104+
105+
2. [Linked Lists: Locking, Lock-Free and Beyond...](http://janvitek.org/events/TiC06/B-SLIDES/mh2.pdf)
106+
107+
The codebase consists of three main source files:
108+
109+
* [`base.rs`](https://github.com/stjepang/skiplist/blob/master/src/base.rs) -
110+
Contains the base skip list implementation details. This file does not attempt to
111+
expose something ergonomic, but instead aims to provide a skip list 'engine' that
112+
is intended to be wrapped into a nicer interface.
113+
114+
* [`map.rs`](https://github.com/stjepang/skiplist/blob/master/src/map.rs) -
115+
Wraps the base implementation into a map interface similar to `BTreeMap`.
116+
117+
* [`set.rs`](https://github.com/stjepang/skiplist/blob/master/src/set.rs) -
118+
Wraps the base implementation into a set interface similar to `BTreeSet`.
119+
120+
**Note:** These map and set wrappers are just tentative interfaces - they're
121+
finished, and there's a possibility we'll want to completely change them.
122+
For now, consider them just a proof of concept.
123+
124+
## Tentative map API
125+
126+
Here's a quick demo. The following code is the [first example](https://doc.rust-lang.org/std/collections/struct.BTreeMap.html#examples)
127+
taken from `BTreeMap`'s documentation, except it uses `SkipMap` instead of `BTreeMap`.
128+
A few other minor changes were required to make it compile, but other than that it doesn't depart
129+
too far from the original:
130+
131+
```rust
132+
// type inference lets us omit an explicit type signature (which
133+
// would be `SkipMap<&str, &str>` in this example).
134+
let movie_reviews = SkipMap::new();
135+
136+
// review some movies.
137+
movie_reviews.insert("Office Space", "Deals with real issues in the workplace.");
138+
movie_reviews.insert("Pulp Fiction", "Masterpiece.");
139+
movie_reviews.insert("The Godfather", "Very enjoyable.");
140+
movie_reviews.insert("The Blues Brothers", "Eye lyked it alot.");
141+
142+
// check for a specific one.
143+
if !movie_reviews.contains_key("Les Misérables") {
144+
println!("We've got {} reviews, but Les Misérables ain't one.",
145+
movie_reviews.len());
146+
}
147+
148+
// oops, this review has a lot of spelling mistakes, let's delete it.
149+
movie_reviews.remove("The Blues Brothers");
150+
151+
// look up the values associated with some keys.
152+
let to_find = ["Up!", "Office Space"];
153+
for book in &to_find {
154+
match movie_reviews.get(book) {
155+
Some(entry) => println!("{}: {}", book, entry.value()),
156+
None => println!("{} is unreviewed.", book)
157+
}
158+
}
159+
160+
// iterate over everything.
161+
for entry in &movie_reviews {
162+
let movie = entry.key();
163+
let review = entry.value();
164+
println!("{}: \"{}\"", movie, review);
165+
}
166+
```
167+
168+
Take a look at [map.rs](https://github.com/stjepang/skiplist/blob/master/src/map.rs) to
169+
see the full interface of `SkipMap`.
170+
171+
An interesting difference from `BTreeMap` is that methods like `insert` and `get` return
172+
an `Entry<'a, K, V>`, which is essentially just a reference-counted pointer to an entry in
173+
the skip list. Note that it is possible to hold an entry and remove it at the same time (you
174+
can even call `entry.remove()`), but the actual contents of the entry will not be destroyed
175+
before the last reference is dropped.
176+
177+
### Performance
178+
179+
It has already been mentioned that `SkipMap` will have a hard time competing with `BTreeMap` in
180+
single-threaded scenarios.
181+
Let's see that through a simple benchmark that just inserts a million pseudorandom
182+
numbers into a map. This is a very unscientific benchmark, but it should at least give
183+
us a feeling for how different map implementations fare against each other.
184+
185+
Machine: Intel Core i7-5600U (2 physical cores, 4 logical cores)
186+
187+
First, here's `BTreeMap` in three different scenarios:
188+
189+
* [`BTreeMap<u64, u64>` (1 thread)](https://gist.github.com/stjepang/9b1bf73c2fdb0309cefda66b91f633dd): 315 ms
190+
* [`Mutex<BTreeMap<u64, u64>>` (1 thread)](https://gist.github.com/stjepang/437b82134b401d3fa2c9c439a003c1ea): 321 ms
191+
* [`Mutex<BTreeMap<u64, u64>>` (2 threads)](https://gist.github.com/stjepang/66000dfae15c8046b91ff3612c7d881f): 752 ms
192+
193+
Notice how there is very little overhead of locking if only one thread is used. However, as soon as
194+
we add more threads, contended locking brings a huge penalty on performance.
195+
196+
But `SkipMap` doesn't suffer from the same problem. In fact, adding more threads improves performance:
197+
198+
* [`SkipMap<u64, u64>` (1 thread)](https://gist.github.com/stjepang/1980ab811009e94f2adfe8b230c20047): 1028 ms
199+
* [`SkipMap<u64, u64>` (2 threads)](https://gist.github.com/stjepang/a3f8f6dddac56d43e7dbfb2928cd3bfe): 561 ms
200+
201+
Let's also see some numbers for a mutex-protected `std::map` in C++:
202+
203+
* [`std::map<uint64_t, uint64_t>` (1 thread)](https://gist.github.com/stjepang/6aa80020b6edac1f6ea9af518e4ad989): 881 ms
204+
* [`std::map<uint64_t, uint64_t>` (2 threads)](https://gist.github.com/stjepang/b172a4259c0439d2855bc68fd47b3ab7): 1127 ms
205+
206+
And here are mutex-protected `TreeMap` and `ConcurrentSkipListMap` in Java:
207+
208+
* [`TreeMap<long, long> (1 thread)`](https://gist.github.com/stjepang/3bc21528f5cf82ecd564778f8a861b11): 1211 ms
209+
* [`TreeMap<long, long> (2 threads)`](https://gist.github.com/stjepang/da69ad273ea2cf2e13b4322c0ea6bd74): 1409 ms
210+
* [`ConcurrentSkipListMap<long, long> (1 thread)`](https://gist.github.com/stjepang/f6f289c07759f47a72b0565fd6b992c7): 2181 ms
211+
* [`ConcurrentSkipListMap<long, long> (2 threads)`](https://gist.github.com/stjepang/74d1abc7230ad6e6dd0c4aec1f4cab4b): 1353 ms
212+
213+
The bottom line is: in single-threaded scenarios `SkipMap` should be comparable in performance
214+
to any typical binary search tree (although not to a B-tree). As we add more threads, it seems to
215+
scale quite well. I don't have a machine with a high number of cores to test scalability more thoroughly,
216+
but these numbers seem promising so far nonetheless.
217+
218+
### Iteration
219+
220+
The skip list supports easy iteration. Note that when iterating over a `SkipMap` we hand out an `Entry`
221+
for each entry in it. Creating an entry involves incrementing its reference count, and when moving
222+
from one entry to another we also have to pin the current thread. This is a lot of reference count
223+
updating and a lot of pinning.
224+
225+
Here are some benchmark numbers for iterating over a million randomly inserted entries:
226+
227+
* `BTreeMap` (Rust): 18 ms
228+
* `SkipMap` (Rust): 113 ms
229+
* `std::map` (C++): 93 ms
230+
* `TreeMap` (Java): 41 ms
231+
* `ConcurrentSkipListMap` (Java): 32 ms
232+
233+
Interesting observations:
234+
235+
* Iteration over a `BTreeMap` is very fast - this shouldn't be surprising since adjacent
236+
elements are grouped into blocks.
237+
238+
* `SkipMap` is the slowest map here. I've tried measuring how long iteration takes withut
239+
reference count updating and without pinning, and it turns out to be around 95 ms. That is
240+
very similar to `std::map` in C++. Also, reference counting and pinning definitely incurs
241+
a measurable cost, but it's not a *terrible* one.
242+
243+
* Java is fast - even `TreeMap` is faster than `std::map` in C++. How is that possible?
244+
Well, the answer lies in the fact that Java's GC kicks from time to time, moves
245+
allocated nodes in memory (it's a compacting GC), and tries to lay out linked nodes
246+
as close as possible, thus optimizing for cache efficiency.
247+
248+
* Let's try tuning Java's GC by using option `-XX:NewSize=1024m`. This option sets the
249+
size of the new generation to 1024 MB (a huge number), which means compaction should
250+
never kick in. Indeed, iteration timings are much different now - with `TreeMap` it takes
251+
124 ms and with `ConcurrentSkipListMap` it takes 110 ms. Now that's much closer to `SkipMap`
252+
and `std::map`.
253+
254+
### The cost of reference counting in `Entry`
255+
256+
When iterating over a skip list we use `Entry`s, which are essentially reference-counted
257+
pointers to skip list nodes. That means iterating over 100 elements incurs the
258+
cost of 200 atomic increments and 200 atomic decrements.
259+
260+
Methods that insert, remove, or search for an element return `Entry`, which means
261+
they too incur some cost spent on incrementing and decrementing a node's reference count.
262+
263+
The current skip list implementation doesn't provide alternative methods that avoid
264+
reference counting (i.e. avoid using `Entry`), but in the future we should discuss how
265+
to add them. Broadly speaking, there are three general alternatives to entries:
266+
clones, guards, and closures. Here's an illustration on the `SkipMap::get` method:
267+
268+
```rust
269+
// Reference counting: return an `Entry`.
270+
//
271+
// This is the method signature we currently have.
272+
fn get(&self, k: &K) -> Option<Entry<K, V>>;
273+
274+
// Alternative #1: return a clone of the element.
275+
//
276+
// This means we're paying the price of cloning, but that's
277+
// not a problem if cloning is cheap.
278+
fn get_clone(&self, k: &K) -> Option<V> where V: Clone;
279+
280+
// Alternative #2: return a guarded reference to the element
281+
// keeping the thread pinned.
282+
//
283+
// The main drawback here is that the user must be careful
284+
// not to keep the guard live for too long, or else garbage
285+
// collection will get stuck.
286+
fn get_guard(&self, k: &K) -> Option<Guard<K, V>>;
287+
288+
// Alternative #3: take a closure that does something with
289+
// the found element while the thread is still pinned.
290+
//
291+
// Again, the drawback here is that the user must be careful
292+
// not to keep the closure running for too long, or else
293+
// garbage collection will get stuck.
294+
fn get_with<F: FnOnce(&V)>(&self, k: &K, f: F);
295+
```
296+
297+
# Drawbacks
298+
299+
Skip lists are not very exciting when it comes to performance. Hash tables, B-trees (Bw-Tree is
300+
a lock-free B-tree variant), and radix trees (ART - adaptive radix tree can be made concurrent)
301+
are usually more performant. However, these faster data structures are not as general as skip lists
302+
and have to make sacrifices by restricting the set of supported operations or by making the API less ergonomic.
303+
304+
# Alternatives
305+
306+
A few possible similar but alternative data structures might be:
307+
308+
1. Adaptive radix tree (keys can only be byte arrays).
309+
2. Skip tree (moves elements in memory, thus constraining the API).
310+
3. Bw-Tree (moves elements in memory, thus constraining the API).
311+
312+
# Unresolved questions
313+
314+
* Should `Entry` be renamed to `Cursor`?
315+
* How do we make iteration faster by avoiding reference counting?
316+
* What alternatives to the `Entry` API do we need and how to incorporate them?

0 commit comments

Comments
 (0)