Skip to content

Commit a6db00e

Browse files
authored
Merge pull request #1456 from spacejam/project_bloodstone
Project Bloodstone storage engine alpha
2 parents 005c023 + 22d910e commit a6db00e

125 files changed

Lines changed: 10443 additions & 28755 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/test.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ jobs:
5151
run: |
5252
rustup update --no-self-update
5353
cargo test --release --no-default-features --features=for-internal-testing-only -- --nocapture
54-
- uses: actions/upload-artifact@v2
54+
- uses: actions/upload-artifact@v4
5555
if: ${{ failure() && runner.os == 'linux' }}
5656
with:
5757
name: linux-core-dumps
@@ -134,7 +134,7 @@ jobs:
134134
echo ""
135135
echo "all backtraces:"
136136
gdb target/release/stress2 core-dumps/* -batch -ex 't a a bt -frame-info source-and-location'
137-
- uses: actions/upload-artifact@v2
137+
- uses: actions/upload-artifact@v4
138138
if: ${{ failure() }}
139139
with:
140140
name: linux-core-dumps

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
1+
fuzz-*.log
12
default.sled
2-
crash_*
3+
timing_test*
34
*db
5+
crash_test_files
46
*conf
57
*snap.*
68
*grind.out*

ARCHITECTURE.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
<table style="width:100%">
2+
<tr>
3+
<td>
4+
<table style="width:100%">
5+
<tr>
6+
<td> key </td>
7+
<td> value </td>
8+
</tr>
9+
<tr>
10+
<td><a href="https://github.com/sponsors/spacejam">buy a coffee for us to convert into databases</a></td>
11+
<td><a href="https://github.com/sponsors/spacejam"><img src="https://img.shields.io/github/sponsors/spacejam"></a></td>
12+
</tr>
13+
<tr>
14+
<td><a href="https://docs.rs/sled">documentation</a></td>
15+
<td><a href="https://docs.rs/sled"><img src="https://docs.rs/sled/badge.svg"></a></td>
16+
</tr>
17+
<tr>
18+
<td><a href="https://discord.gg/Z6VsXds">chat about databases with us</a></td>
19+
<td><a href="https://discord.gg/Z6VsXds"><img src="https://img.shields.io/discord/509773073294295082.svg?logo=discord"></a></td>
20+
</tr>
21+
</table>
22+
</td>
23+
<td>
24+
<p align="center">
25+
<img src="https://raw.githubusercontent.com/spacejam/sled/main/art/tree_face_anti-transphobia.png" width="40%" height="auto" />
26+
</p>
27+
</td>
28+
</tr>
29+
</table>
30+
31+
# sled 1.0 architecture
32+
33+
## in-memory
34+
35+
* Lock-free B+ tree index, extracted into the [`concurrent-map`](https://github.com/komora-io/concurrent-map) crate.
36+
* The lowest key from each leaf is stored in this in-memory index.
37+
* To read any leaf that is not already cached in memory, at most one disk read will be required.
38+
* RwLock-backed leaves, using the ArcRwLock from the [`parking_lot`](https://github.com/Amanieu/parking_lot) crate. As a `Db` grows, leaf contention tends to go down in most use cases. But this may be revisited over time if many users have issues with RwLock-related contention. Avoiding full RCU for updates on the leaves results in many of the performance benefits over sled 0.34, with significantly lower memory pressure.
39+
* A simple but very high performance epoch-based reclamation technique is used for safely deferring frees of in-memory index data and reuse of on-disk heap slots, extracted into the [`ebr`](https://github.com/komora-io/ebr) crate.
40+
* A scan-resistant LRU is used for handling eviction. By default, 20% of the cache is reserved for leaves that are accessed at most once. This is configurable via `Config.entry_cache_percent`. This is handled by the extracted [`cache-advisor`](https://github.com/komora-io/cache-advisor) crate. The overall cache size is set by the `Config.cache_size` configurable.
41+
42+
## write path
43+
44+
* This is where things get interesting. There is no traditional WAL. There is no LSM. Only metadata is logged atomically after objects are written in parallel.
45+
* The important guarantees are:
46+
* all previous writes are durable after a call to `Db::flush` (This is also called periodically in the background by a flusher thread)
47+
* all write batches written using `Db::apply_batch` are either 100% visible or 0% visible after crash recovery. If it was followed by a flush that returned `Ok(())` it is guaranteed to be present.
48+
* Atomic ([linearizable](https://jepsen.io/consistency/models/linearizable)) durability is provided by marking dirty leaves as participants in "flush epochs" and performing atomic batch writes of the full epoch at a time, in order. Each call to `Db::flush` advances the current flush epoch by 1.
49+
* The atomic write consists in the following steps:
50+
1. User code or the background flusher thread calls `Db::flush`.
51+
1. In parallel (via [rayon](https://docs.rs/rayon)) serialize and compress each dirty leaf with zstd (configurable via `Config.zstd_compression_level`).
52+
1. Based on the size of the bytes for each object, choose the smallest heap file slot that can hold the full set of bytes. This is an on-disk slab allocator.
53+
1. Slab slots are not power-of-two sized, but tend to increase in size by around 20% from one to the next, resulting in far lower fragmentation than typical page-oriented heaps with either constant-size or power-of-two sized leaves.
54+
1. Write the object to the allocated slot from the rayon threadpool.
55+
1. After all writes, fsync the heap files that were written to.
56+
1. If any writes were written to the end of the heap file, causing it to grow, fsync the directory that stores all heap files.
57+
1. After the writes are stable, it is now safe to write an atomic metadata batch that records the location of each written leaf in the heap. This is a simple framed batch of `(low_key, slab_slot)` tuples that are initially written to a log, but eventually merged into a simple snapshot file for the metadata store once the log becomes larger than the snapshot file.
58+
1. Fsync of the metadata log file.
59+
1. Fsync of the metadata log directory.
60+
1. After the atomic metadata batch write, the previously occupied slab slots are marked for future reuse with the epoch-based reclamation system. After all threads that may have witnessed the previous location have finished their work, the slab slot is added to the free `BinaryHeap` of the slot that it belongs to so that it may be reused in future atomic write batches.
61+
1. Return `Ok(())` to the caller of `Db::flush`.
62+
* Writing objects before the metadata write is random, but modern SSDs handle this well. Even though the SSD's FTL will be working harder to defragment things periodically than if we wrote a few megabytes sequentially with each write, the data that the FTL will be copying will be mostly live due to the eager leaf write-backs.
63+
64+
## recovery
65+
66+
* Recovery involves simply reading the atomic metadata store that records the low key for each written leaf as well as its location and mapping it into the in-memory index. Any gaps in the slabs are then used as free slots.
67+
* Any write that failed to complete its entire atomic writebatch is treated as if it never happened, because no user-visible flush ever returned successfully.
68+
* Rayon is also used here for parallelizing reads of this metadata. In general, this is extremely fast compared to the previous sled recovery process.
69+
70+
## tuning
71+
72+
* The larger the `LEAF_FANOUT` const generic on the high-level `Db` struct (default `1024`), the smaller the in-memory leaf index and the better the compression ratio of the on-disk file, but the more expensive it will be to read the entire leaf off of disk and decompress it.
73+
* You can choose to turn the `LEAF_FANOUT` relatively low to make the system behave more like an Index+Log architecture, but overall disk size will grow and write performance will decrease.
74+
* NB: changing `LEAF_FANOUT` after writing data is not supported.

Cargo.toml

Lines changed: 46 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,70 +1,73 @@
11
[package]
22
name = "sled"
3-
version = "0.34.7"
4-
authors = ["Tyler Neely <t@jujit.su>"]
3+
version = "1.0.0-alpha.124"
4+
edition = "2021"
5+
authors = ["Tyler Neely <tylerneely@gmail.com>"]
6+
documentation = "https://docs.rs/sled/"
57
description = "Lightweight high-performance pure-rust transactional embedded database."
68
license = "MIT OR Apache-2.0"
79
homepage = "https://github.com/spacejam/sled"
810
repository = "https://github.com/spacejam/sled"
911
keywords = ["redis", "mongo", "sqlite", "lmdb", "rocksdb"]
1012
categories = ["database-implementations", "concurrency", "data-structures", "algorithms", "caching"]
11-
documentation = "https://docs.rs/sled/"
1213
readme = "README.md"
13-
edition = "2018"
1414
exclude = ["benchmarks", "examples", "bindings", "scripts", "experiments"]
1515

16-
[package.metadata.docs.rs]
17-
features = ["docs", "metrics"]
18-
19-
[badges]
20-
maintenance = { status = "actively-developed" }
16+
[features]
17+
# initializes allocated memory to 0xa1, writes 0xde to deallocated memory before freeing it
18+
testing-shred-allocator = []
19+
# use a counting global allocator that provides the sled::alloc::{allocated, freed, resident, reset} functions
20+
testing-count-allocator = []
21+
for-internal-testing-only = []
22+
# turn off re-use of object IDs and heap slots, disable tree leaf merges, disable heap file truncation.
23+
monotonic-behavior = []
2124

2225
[profile.release]
2326
debug = true
2427
opt-level = 3
2528
overflow-checks = true
29+
panic = "abort"
2630

27-
[features]
28-
default = []
29-
for-internal-testing-only = ["event_log", "lock_free_delays", "light_testing"]
30-
light_testing = ["failpoints", "backtrace", "memshred"]
31-
lock_free_delays = []
32-
failpoints = []
33-
event_log = []
34-
metrics = ["num-format"]
35-
no_logs = ["log/max_level_off"]
36-
no_inline = []
37-
pretty_backtrace = ["color-backtrace"]
38-
docs = []
39-
no_zstd = []
40-
miri_optimizations = []
41-
mutex = []
42-
memshred = []
31+
[profile.test]
32+
debug = true
33+
overflow-checks = true
34+
panic = "abort"
4335

4436
[dependencies]
45-
libc = "0.2.96"
46-
crc32fast = "1.2.1"
47-
log = "0.4.14"
48-
parking_lot = "0.12.1"
49-
color-backtrace = { version = "0.5.1", optional = true }
50-
num-format = { version = "0.4.0", optional = true }
51-
backtrace = { version = "0.3.60", optional = true }
52-
im = "15.1.0"
53-
54-
[target.'cfg(any(target_os = "linux", target_os = "macos", target_os="windows"))'.dependencies]
37+
bincode = "1.3.3"
38+
cache-advisor = "1.0.16"
39+
concurrent-map = { version = "5.0.31", features = ["serde"] }
40+
crc32fast = "1.3.2"
41+
ebr = "0.2.13"
42+
inline-array = { version = "0.1.13", features = ["serde", "concurrent_map_minimum"] }
5543
fs2 = "0.4.3"
44+
log = "0.4.19"
45+
pagetable = "0.4.5"
46+
parking_lot = { version = "0.12.1", features = ["arc_lock"] }
47+
rayon = "1.7.0"
48+
serde = { version = "1.0", features = ["derive"] }
49+
stack-map = { version = "1.0.5", features = ["serde"] }
50+
zstd = "0.12.4"
51+
fnv = "1.0.7"
52+
fault-injection = "1.0.10"
53+
crossbeam-queue = "0.3.8"
54+
crossbeam-channel = "0.5.8"
55+
tempdir = "0.3.7"
5656

5757
[dev-dependencies]
58-
rand = "0.7"
59-
rand_chacha = "0.3.1"
60-
rand_distr = "0.3"
61-
quickcheck = "0.9"
62-
log = "0.4.14"
63-
env_logger = "0.9.0"
64-
zerocopy = "0.6.0"
65-
byteorder = "1.4.3"
58+
env_logger = "0.10.0"
59+
num-format = "0.4.4"
60+
# heed = "0.11.0"
61+
# rocksdb = "0.21.0"
62+
# rusqlite = "0.29.0"
63+
# old_sled = { version = "0.34", package = "sled" }
64+
rand = "0.8.5"
65+
quickcheck = "1.0.3"
66+
rand_distr = "0.4.3"
67+
libc = "0.2.147"
6668

6769
[[test]]
6870
name = "test_crash_recovery"
6971
path = "tests/test_crash_recovery.rs"
7072
harness = false
73+

LICENSE-APACHE

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -194,6 +194,7 @@
194194
Copyright 2020 Tyler Neely
195195
Copyright 2021 Tyler Neely
196196
Copyright 2022 Tyler Neely
197+
Copyright 2023 Tyler Neely
197198

198199
Licensed under the Apache License, Version 2.0 (the "License");
199200
you may not use this file except in compliance with the License.

LICENSE-MIT

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,12 @@
1+
Copyright (c) 2015 Tyler Neely
2+
Copyright (c) 2016 Tyler Neely
3+
Copyright (c) 2017 Tyler Neely
14
Copyright (c) 2018 Tyler Neely
25
Copyright (c) 2019 Tyler Neely
36
Copyright (c) 2020 Tyler Neely
47
Copyright (c) 2021 Tyler Neely
58
Copyright (c) 2022 Tyler Neely
9+
Copyright (c) 2023 Tyler Neely
610

711
Permission is hereby granted, free of charge, to any
812
person obtaining a copy of this software and associated

SAFETY.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ These hazards can result in the above losses:
5656
* bugs in the GC system
5757
* the old location is overwritten before the defragmented location becomes durable
5858
* bugs in the recovery system
59-
* hardware failures
59+
* hardare failures
6060
* consistency violations may be caused by
6161
* transaction concurrency control failure to enforce linearizability (strict serializability)
6262
* non-linearizable lock-free single-key operations

benchmarks/criterion/Cargo.toml

Lines changed: 0 additions & 17 deletions
This file was deleted.

0 commit comments

Comments
 (0)