diff --git a/README.md b/README.md
index d4edc0129..3c8f20d56 100644
--- a/README.md
+++ b/README.md
@@ -104,37 +104,39 @@ The documentation provides two measures of complexity:
The complexities are described in terms of the following variables and
constants:
-- The variable *n* refers to the number of *physical* table entries. A
+- The variable $`n`$ refers to the number of *physical* table entries. A
*physical* table entry is any key–operation pair, e.g., `Insert k v`
or `Delete k`, whereas a *logical* table entry is determined by all
- physical entries with the same key. If the variable *n* is used to
+ physical entries with the same key. If the variable $`n`$ is used to
describe the complexity of an operation that involves multiple tables,
it refers to the sum of all table entries.
-- The variable *o* refers to the number of open tables and cursors in
+- The variable $`o`$ refers to the number of open tables and cursors in
the session.
-- The variable *s* refers to the number of snapshots in the session.
+- The variable $`s`$ refers to the number of snapshots in the session.
-- The variable *b* usually refers to the size of a batch of
+- The variable $`b`$ usually refers to the size of a batch of
inputs/outputs. Its precise meaning is explained for each occurrence.
-- The constant *B* refers to the size of the write buffer, which is a
- configuration parameter.
+- The constant $`B`$ refers to the size of the write buffer, which is
+ determined by the `TableConfig` parameter `confWriteBufferAlloc`.
-- The constant *T* refers to the size ratio of the table, which is a
- configuration parameter.
+- The constant $`T`$ refers to the size ratio of the table, which is
+ determined by the `TableConfig` parameter `confSizeRatio`.
-- The constant *P* refers to the the average number of key–value pairs
+- The constant $`P`$ refers to the average number of key–value pairs
that fit in a page of memory.
#### Disk I/O cost of operations
-The following table summarises the cost of the operations on LSM-trees
-measured in the number of disk I/O operations. If the cost depends on
-the merge policy or merge schedule, then the table contains one entry
-for each relevant combination. Otherwise, the merge policy and/or merge
-schedule is listed as N/A.
+The following table summarises the worst-case cost of the operations on
+LSM-trees measured in the number of disk I/O operations. If the cost
+depends on the merge policy or merge schedule, then the table contains
+one entry for each relevant combination. Otherwise, the merge policy
+and/or merge schedule is listed as N/A. The merge policy and merge
+schedule are determined by the `TableConfig` parameters
+`confMergePolicy` and `confMergeSchedule`.
@@ -143,7 +145,7 @@ schedule is listed as N/A.
Operation |
Merge policy |
Merge schedule |
-Cost in disk I/O operations |
+Worst-case disk I/O complexity |
@@ -273,84 +275,377 @@ schedule is listed as N/A.
-(\*The variable *b* refers to the number of entries retrieved by the
+(\*The variable $`b`$ refers to the number of entries retrieved by the
range lookup.)
-TODO: Document the average-case behaviour of lookups.
+#### Table Size
-#### In-memory size of tables
+The in-memory and the on-disk size of an LSM-tree scale *linearly* with
+the number of physical entries. However, the in-memory size is smaller
+by a significant factor. Let us look at a table that uses the default
+configuration and has 100 million entries with 34 byte keys and 60 byte
+values. The total size of 100 million key–value pairs is approximately
+8.75GiB. Hence, the on-disk size would be at least 8.75GiB, not counting
+the overhead for metadata.
-The in-memory size of an LSM-tree is described in terms of the variable
-*n*, which refers to the number of *physical* database entries. A
-*physical* database entry is any key–operation pair, e.g., `Insert k v`
-or `Delete k`, whereas a *logical* database entry is determined by all
-physical entries with the same key.
+The in-memory size would be approximately 265.39MiB:
-The worst-case in-memory size of an LSM-tree is *O*(*n*).
+- The write buffer would store at most 20,000 entries, which is
+ approximately 2.86MiB.
-- The worst-case in-memory size of the write buffer is *O*(*B*).
+- The fence-pointer indexes would store approximately 2.29 million keys,
+ which is approximately 9.30MiB.
- The maximum size of the write buffer on the write buffer allocation
- strategy, which is determined by the `confWriteBufferAlloc` field of
- `TableConfig`. Regardless of write buffer allocation strategy, the
- size of the write buffer may never exceed 4GiB.
+- The Bloom filters would use 15.78 bits per entry, which is
+ approximately 188.11MiB.
- `AllocNumEntries maxEntries`
- The maximum size of the write buffer is the maximum number of entries
- multiplied by the average size of a key–operation pair.
+For a discussion of how the sizes of these components are determined by
+the table configuration, see [Fine-tuning Table
+Configuration](#fine_tuning "#fine_tuning").
-- The worst-case in-memory size of the Bloom filters is *O*(*n*).
-
- The total in-memory size of all Bloom filters is the number of bits
- per physical entry multiplied by the number of physical entries. The
- required number of bits per physical entry is determined by the Bloom
- filter allocation strategy, which is determined by the
- `confBloomFilterAlloc` field of `TableConfig`.
-
- `AllocFixed bitsPerPhysicalEntry`
- The number of bits per physical entry is specified as
- `bitsPerPhysicalEntry`.
-
- `AllocRequestFPR requestedFPR`
- The number of bits per physical entry is determined by the requested
- false-positive rate, which is specified as `requestedFPR`.
-
- The false-positive rate scales exponentially with the number of bits
- per entry:
-
- | False-positive rate | Bits per entry |
- |---------------------|----------------|
- | 1 in 10 | ≈ 4.77 |
- | 1 in 100 | ≈ 9.85 |
- | 1 in 1, 000 | ≈ 15.79 |
- | 1 in 10, 000 | ≈ 22.58 |
- | 1 in 100, 000 | ≈ 30.22 |
-
-- The worst-case in-memory size of the indexes is *O*(*n*).
-
- The total in-memory size of all indexes depends on the index type,
- which is determined by the `confFencePointerIndex` field of
- `TableConfig`. The in-memory size of the various indexes is described
- in reference to the size of the database in [*memory
- pages*](https://en.wikipedia.org/wiki/Page_%28computer_memory%29 "https://en.wikipedia.org/wiki/Page_%28computer_memory%29").
-
- `OrdinaryIndex`
- An ordinary index stores the maximum serialised key for each memory
- page. The total in-memory size of all indexes is proportional to the
- average size of one serialised key per memory page.
-
- `CompactIndex`
- A compact index stores the 64 most significant bits of the minimum
- serialised key for each memory page, as well as 1 bit per memory page
- to resolve clashes, 1 bit per memory page to mark overflow pages, and
- a negligible amount of memory for tie breakers. The total in-memory
- size of all indexes is approximately 66 bits per memory page.
-
-The total size of an LSM-tree must not exceed 241 physical
+The total size of an LSM-tree must not exceed $`2^{41}`$ physical
entries. Violation of this condition *is* checked and will throw a
`TableTooLargeError`.
-### Implementation
+#### Fine-tuning Table Configuration
+
+`confMergePolicy`
+The *merge policy* balances the performance of lookups against the
+performance of updates. Levelling favours lookups. Tiering favours
+updates. Lazy levelling strikes a middle ground between levelling and
+tiering, and moderately favours updates. This parameter is explicitly
+referenced in the documentation of those operations it affects.
+
+`confSizeRatio`
+The *size ratio* pushes the effects of the merge policy to the extreme.
+If the size ratio is higher, levelling favours lookups more, and tiering
+and lazy levelling favour updates more. This parameter is referred to as
+$`T`$ in the disk I/O cost of operations.
+
+`confWriteBufferAlloc`
+The *write buffer capacity* balances the performance of lookups and
+updates against the in-memory size of the table. If the write buffer is
+larger, it takes up more memory, but lookups and updates are more
+efficient. This parameter is referred to as $`B`$ in the disk I/O cost
+of operations. Irrespective of this parameter, the write buffer size
+cannot exceed 4GiB.
+
+`confMergeSchedule`
+The *merge schedule* balances the performance of lookups and updates
+against the smooth performance of updates. The merge schedule does not
+affect the performance of table unions. With the one-shot merge
+schedule, lookups and updates are more efficient overall, but some
+updates may take much longer than others. With the incremental merge
+schedule, lookups and updates are less efficient overall, but each
+update does a similar amount of work. This parameter is explicitly
+referenced in the documentation of those operations it affects.
+
+`confBloomFilterAlloc`
+The Bloom filter size balances the performance of lookups against the
+in-memory size of the table. If the Bloom filters are larger, they take
+up more memory, but lookup operations are more efficient.
+
+`confFencePointerIndex`
+The *fence-pointer index type* supports two types of indexes. The
+*ordinary* indexes are designed to work with any key. The *compact*
+indexes are optimised for the case where the keys in the database are
+uniformly distributed, e.g., when the keys are hashes.
+
+`confDiskCachePolicy`
+The *disk cache policy* determines if lookup operations use the OS page
+cache. Caching may improve the performance of lookups if database access
+follows certain patterns.
+
+##### Fine-tuning: Merge Policy, Size Ratio, and Write Buffer Size
+
+The configuration parameters `confMergePolicy`, `confSizeRatio`, and
+`confWriteBufferAlloc` affect how the table organises its data. To
+understand what effect these parameters have, one must have a basic
+understand of how an LSM-tree stores its data. The physical entries in
+an LSM-tree are key–operation pairs, which pair a key with an operation
+such as an `Insert` with a value or a `Delete`. These key–operation
+pairs are organised into *runs*, which are sequences of key–operation
+pairs sorted by their key. Runs are organised into *levels*, which are
+unordered sequences or runs. Levels are organised hierarchically. Level
+0 is kept in memory, and is referred to as the *write buffer*. All
+subsequent levels are stored on disk, with each run stored in its own
+file. The following shows an example LSM-tree layout, with each run as a
+boxed sequence of keys and each level as a row.
+
+``` math
+
+\begin{array}{l:l}
+\text{Level}
+&
+\text{Data}
+\\
+0
+&
+\fbox{\(\texttt{4}\,\_\)}
+\\
+1
+&
+\fbox{\(\texttt{1}\,\texttt{3}\)}
+\quad
+\fbox{\(\texttt{2}\,\texttt{7}\)}
+\\
+2
+&
+\fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)}
+\end{array}
+```
+
+The data in an LSM-tree is *partially sorted*: only the key–operation
+pairs within each run are sorted and deduplicated. As a rule of thumb,
+keeping more of the data sorted means lookup operations are faster but
+update operations are slower.
+
+The configuration parameters `confMergePolicy`, `confSizeRatio`, and
+`confWriteBufferAlloc` directly affect a table's data layout. The
+parameter `confWriteBufferAlloc` determines the capacity of the write
+buffer.
+
+`AllocNumEntries maxEntries`
+The write buffer can contain at most `maxEntries` entries. The constant
+$`B`$ refers to the value of `maxEntries`. Irrespective of this
+parameter, the write buffer size cannot exceed 4GiB.
+
+The parameter `confSizeRatio` determines the ratio between the
+capacities of successive levels. The constant $`T`$ refers to the value
+of `confSizeRatio`. For instance, if $`B = 2`$ and $`T = 2`$, then
+
+``` math
+
+\begin{array}{l:l}
+\text{Level} & \text{Capacity}
+\\
+0 & B \cdot T^0 = 2
+\\
+1 & B \cdot T^1 = 4
+\\
+2 & B \cdot T^2 = 8
+\\
+\ell & B \cdot T^\ell
+\end{array}
+```
+
+The merge policy `confMergePolicy` determines the number of runs per
+level. In a *tiering* LSM-tree, each level contains $`T`$ runs. In a
+*levelling* LSM-tree, each level contains one single run. The *lazy
+levelling* policy uses levelling only for the last level and uses
+tiering for all preceding levels. The previous example used lazy
+levelling. The following examples illustrate the different merge
+policies using the same data, assuming $`B = 2`$ and $`T = 2`$.
+
+``` math
+
+\begin{array}{l:l:l:l}
+\text{Level}
+&
+\text{Tiering}
+&
+\text{Levelling}
+&
+\text{Lazy Levelling}
+\\
+0
+&
+\fbox{\(\texttt{4}\,\_\)}
+&
+\fbox{\(\texttt{4}\,\_\)}
+&
+\fbox{\(\texttt{4}\,\_\)}
+\\
+1
+&
+\fbox{\(\texttt{1}\,\texttt{3}\)}
+\quad
+\fbox{\(\texttt{2}\,\texttt{7}\)}
+&
+\fbox{\(\texttt{1}\,\texttt{2}\,\texttt{3}\,\texttt{7}\)}
+&
+\fbox{\(\texttt{1}\,\texttt{3}\)}
+\quad
+\fbox{\(\texttt{2}\,\texttt{7}\)}
+\\
+2
+&
+\fbox{\(\texttt{4}\,\texttt{5}\,\texttt{7}\,\texttt{8}\)}
+\quad
+\fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{9}\)}
+&
+\fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)}
+&
+\fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)}
+\end{array}
+```
+
+Tiering favours the performance of updates. Levelling favours the
+performance of lookups. Lazy levelling strikes a middle ground between
+tiering and levelling. It favours the performance of lookup operations
+for the oldest data and enables more deduplication, without the impact
+that full levelling has on update operations.
+
+##### Fine-tuning: Merge Schedule
+
+The configuration parameter `confMergeSchedule` affects the worst-case
+performance of lookup and update operations and the structure of runs.
+Regardless of the merge schedule, the amortised disk I/O complexity of
+lookups and updates is *logarithmic* in the size of the table. When the
+write buffer fills up, its contents are flushed to disk as a run and
+added to level 1. When some level fills up, its contents are flushed
+down to the next level. Eventually, as data is flushed down, runs must
+be merged. This package supports two schedules for merging:
+
+- Using the `OneShot` merge schedule, runs must always be kept fully
+ sorted and deduplicated. However, flushing a run down to the next
+ level may cause the next level to fill up, in which case it too must
+ be flushed and merged futher down. In the worst case, this can cascade
+ down the entire table. Consequently, the worst-case disk I/O
+ complexity of updates is *linear* in the size of the table. This is
+ unacceptable for real-time systems and other use cases where
+ unresponsiveness is unacceptable.
+
+- Using the `Incremental` merge schedule, runs can be *partially
+ merged*, which allows the merging work to be spead out evenly across
+ all update operations. This aligns the worst-case and average-case
+ disk I/O complexity of updates—both are *logarithmic* in the size of
+ the table. The cost is a small constant overhead for both lookup and
+ update operations.
+
+The merge schedule does not affect the performance of table unions. The
+amortised disk I/O complexity of one-shot unions is *linear* in the size
+of the tables. Instead, there are separate operations for incremental
+and oneshot unions. For incremental unions, it is up to the user to
+spread the merging work out evenly over time.
+
+##### Fine-tuning: Bloom Filter Size
+
+The configuration parameter `confBloomFilterAlloc` affects the size of
+the Bloom filters, which balances the performance of lookups against the
+in-memory size of the table.
+
+Tables maintain a [Bloom
+filter](https://en.wikipedia.org/wiki/Bloom_filter "https://en.wikipedia.org/wiki/Bloom_filter")
+in memory for each run on disk. These Bloom filter are probablilistic
+datastructure that are used to track which keys are present in their
+corresponding run. Querying a Bloom filter returns either "maybe"
+meaning the key is possibly in the run or "no" meaning the key is
+definitely not in the run. When a query returns "maybe" while the key is
+*not* in the run, this is referred to as a *false positive*. While the
+database executes a lookup operation, any Bloom filter query that
+returns a false positive causes the database to unnecessarily read a run
+from disk. The probabliliy of these spurious reads follow a [binomial
+distribution](https://en.wikipedia.org/wiki/Binomial_distribution "https://en.wikipedia.org/wiki/Binomial_distribution")
+$`\text{Binomial}(r,\text{FPR})`$ where $`r`$ refers to the number of
+runs and $`\text{FPR}`$ refers to the false-positive rate of the Bloom
+filters. Hence, the expected number of spurious reads for each lookup
+operation is $`r\cdot\text{FPR}`$. The number of runs $`r`$ is
+proportional to the number of physical entries in the table. Its exact
+value depends on the merge policy of the table:
+
+`LazyLevelling`
+$`r = T (\log_T\frac{n}{B} - 1) + 1`$.
+
+The false-positive rate scales exponentially with size of the Bloom
+filters in bits per entry.
+
+| False-positive rate (FPR) | Bits per entry (BPE) |
+|---------------------------|----------------------|
+| $`1\text{ in }10`$ | $`\approx 4.77 `$ |
+| $`1\text{ in }100`$ | $`\approx 9.85 `$ |
+| $`1\text{ in }1{,}000`$ | $`\approx 15.78 `$ |
+| $`1\text{ in }10{,}000`$ | $`\approx 22.57 `$ |
+| $`1\text{ in }100{,}000`$ | $`\approx 30.22 `$ |
+
+The configuration parameter `confBloomFilterAlloc` can be specified in
+two ways:
+
+`AllocFixed bitsPerEntry`
+Allocate the requested number of bits per entry in the table.
+
+The value must strictly positive, but fractional values are permitted.
+The recommended range is $`[2, 24]`$.
+
+`AllocRequestFPR falsePositiveRate`
+Allocate the required number of bits per entry to get the requested
+false-positive rate.
+
+The value must be in the range $`(0, 1)`$. The recommended range is
+$`[1\mathrm{e}{ -5 },1\mathrm{e}{ -2 }]`$.
+
+The total in-memory size of all Bloom filters scales *linearly* with the
+number of physical entries in the table and is determined by the number
+of physical entries multiplied by the number of bits per physical entry,
+i.e, $`n\cdot\text{BPE}`$. Let us consider a table with 100 million
+physical entries which uses the default table configuration for every
+parameter other than the Bloom filter size.
+
+| False-positive rate (FPR) | Bloom filter size | Expected spurious reads per lookup |
+|----|----|----|
+| $`1\text{ in }10`$ | $` 56.86\text{MiB}`$ | $` 2.56\text{ spurious reads every lookup }`$ |
+| $`1\text{ in }100`$ | $`117.42\text{MiB}`$ | $` 1 \text{ spurious read every } 3.91\text{ lookups }`$ |
+| $`1\text{ in }1{,}000`$ | $`188.11\text{MiB}`$ | $` 1 \text{ spurious read every } 39.10\text{ lookups }`$ |
+| $`1\text{ in }10{,}000`$ | $`269.06\text{MiB}`$ | $` 1 \text{ spurious read every } 391.01\text{ lookups }`$ |
+| $`1\text{ in }100{,}000`$ | $`360.25\text{MiB}`$ | $` 1 \text{ spurious read every } 3910.19\text{ lookups }`$ |
+
+##### Fine-tuning: Fence-Pointer Index Type
+
+The configuration parameter `confFencePointerIndex` affects the type and
+size of the fence-pointer indexes. Tables maintain a fence-pointer index
+in memory for each run on disk. These fence-pointer indexes store the
+keys at the boundaries of each page of memory to ensure that each lookup
+has to read at most one page of memory from each run. Tables support two
+types of fence-pointer indexes:
+
+`OrdinaryIndex`
+Ordinary indexes are designed for any use case.
+
+Ordinary indexes store one serialised key per page of memory. The total
+in-memory size of all indexes is $`K \cdot \frac{n}{P}`$ bits, where
+$`K`$ refers to the average size of a serialised key in bits.
+
+`CompactIndex`
+Compact indexes are designed for the use case where the keys in the
+table are uniformly distributed, such as when using hashes.
+
+Compact indexes store the 64 most significant bits of the minimum
+serialised key of each page of memory. This requires that serialised
+keys are *at least* 64 bits in size. Compact indexes store 1 additional
+bit per page of memory to resolve collisions, 1 additional bit per page
+of memory to mark entries that are larger than one page, and a
+negligible amount of memory for tie breakers. The total in-memory size
+of all indexes is $`66 \cdot \frac{n}{P}`$ bits.
+
+##### Fine-tuning: Disk Cache Policy
+
+The configuration parameter `confDiskCachePolicy` determines how the
+database uses the OS page cache. This may improve performance if the
+database's *access pattern* has good *temporal locality* or good
+*spatial locality*. The database's access pattern refers to the pattern
+by which entries are accessed by lookup operations. An access pattern
+has good temporal locality if it is likely to access entries that were
+recently accessed or updated. An access pattern has good spatial
+locality if it is likely to access entries that have nearby keys.
+
+- Use the `DiskCacheAll` policy if the database's access pattern has
+ either good spatial locality or both good spatial and temporal
+ locality.
+
+- Use the `DiskCacheLevelOneTo l` policy if the database's access
+ pattern has good temporal locality for updates only. The variable `l`
+ determines the number of levels that are cached. For a description of
+ levels, see [Merge Policy, Size Ratio, and Write Buffer
+ Size](#fine_tuning_data_layout "#fine_tuning_data_layout"). With this
+ setting, the database can be expected to cache up to $`\frac{k}{P}`$
+ pages of memory, where $`k`$ refers to the number of entries that fit
+ in levels $`[1,l]`$ and is defined as $`\sum_{i=1}^{l}BT^{i}`$.
+
+- Use the `DiskCacheNone` policy if the database's access pattern has
+ does not have good spatial or temporal locality. For instance, if the
+ access pattern is uniformly random.
+
+### References
The implementation of LSM-trees in this package draws inspiration from:
diff --git a/bench/macro/lsm-tree-bench-wp8.hs b/bench/macro/lsm-tree-bench-wp8.hs
index bdcae3a01..cb6485349 100644
--- a/bench/macro/lsm-tree-bench-wp8.hs
+++ b/bench/macro/lsm-tree-bench-wp8.hs
@@ -227,7 +227,7 @@ cmdP = O.subparser $ mconcat
setupOptsP :: O.Parser SetupOpts
setupOptsP = pure SetupOpts
- <*> O.option O.auto (O.long "bloom-filter-alloc" <> O.value LSM.defaultBloomFilterAlloc <> O.showDefault <> O.help "Bloom filter allocation method [AllocFixed n | AllocRequestFPR d]")
+ <*> O.option O.auto (O.long "bloom-filter-alloc" <> O.value (LSM.confBloomFilterAlloc LSM.defaultTableConfig) <> O.showDefault <> O.help "Bloom filter allocation method [AllocFixed n | AllocRequestFPR d]")
runOptsP :: O.Parser RunOpts
runOptsP = pure RunOpts
diff --git a/lsm-tree.cabal b/lsm-tree.cabal
index 5454c2f99..b113ee8db 100644
--- a/lsm-tree.cabal
+++ b/lsm-tree.cabal
@@ -71,18 +71,21 @@ description:
* The variable \(s\) refers to the number of snapshots in the session.
* The variable \(b\) usually refers to the size of a batch of inputs\/outputs.
Its precise meaning is explained for each occurrence.
- * The constant \(B\) refers to the size of the write buffer, which is a configuration parameter.
- * The constant \(T\) refers to the size ratio of the table, which is a configuration parameter.
- * The constant \(P\) refers to the the average number of key–value pairs that fit in a page of memory.
+ * The constant \(B\) refers to the size of the write buffer,
+ which is determined by the @TableConfig@ parameter @confWriteBufferAlloc@.
+ * The constant \(T\) refers to the size ratio of the table,
+ which is determined by the @TableConfig@ parameter @confSizeRatio@.
+ * The constant \(P\) refers to the average number of key–value pairs that fit in a page of memory.
=== Disk I\/O cost of operations #performance_time#
- The following table summarises the cost of the operations on LSM-trees measured in the number of disk I\/O operations.
+ The following table summarises the worst-case cost of the operations on LSM-trees measured in the number of disk I\/O operations.
If the cost depends on the merge policy or merge schedule, then the table contains one entry for each relevant combination.
Otherwise, the merge policy and\/or merge schedule is listed as N\/A.
+ The merge policy and merge schedule are determined by the @TableConfig@ parameters @confMergePolicy@ and @confMergeSchedule@.
+----------+------------------------+-----------------+-----------------+------------------------------------------------+
- | Resource | Operation | Merge policy | Merge schedule | Cost in disk I\/O operations |
+ | Resource | Operation | Merge policy | Merge schedule | Worst-case disk I\/O complexity |
+==========+========================+=================+=================+================================================+
| Session | Create\/Open | N\/A | N\/A | \(O(1)\) |
+----------+------------------------+-----------------+-----------------+------------------------------------------------+
@@ -121,65 +124,312 @@ description:
(*The variable \(b\) refers to the number of entries retrieved by the range lookup.)
- TODO: Document the average-case behaviour of lookups.
+ === Table Size #performance_size#
- === In-memory size of tables #performance_size#
+ The in-memory and the on-disk size of an LSM-tree scale /linearly/ with the number of physical entries.
+ However, the in-memory size is smaller by a significant factor.
+ Let us look at a table that uses the default configuration and has 100 million entries with 34 byte keys and 60 byte values.
+ The total size of 100 million key–value pairs is approximately 8.75GiB.
+ Hence, the on-disk size would be at least 8.75GiB, not counting the overhead for metadata.
- The in-memory size of an LSM-tree is described in terms of the variable \(n\), which refers to the number of /physical/ database entries.
- A /physical/ database entry is any key–operation pair, e.g., @Insert k v@ or @Delete k@, whereas a /logical/ database entry is determined by all physical entries with the same key.
+ The in-memory size would be approximately 265.39MiB:
- The worst-case in-memory size of an LSM-tree is \(O(n)\).
+ * The write buffer would store at most 20,000 entries, which is approximately 2.86MiB.
+ * The fence-pointer indexes would store approximately 2.29 million keys, which is approximately 9.30MiB.
+ * The Bloom filters would use 15.78 bits per entry, which is approximately 188.11MiB.
- * The worst-case in-memory size of the write buffer is \(O(B)\).
-
- The maximum size of the write buffer on the write buffer allocation strategy, which is determined by the @confWriteBufferAlloc@ field of @TableConfig@.
- Regardless of write buffer allocation strategy, the size of the write buffer may never exceed 4GiB.
-
- [@AllocNumEntries maxEntries@]:
- The maximum size of the write buffer is the maximum number of entries multiplied by the average size of a key–operation pair.
-
- * The worst-case in-memory size of the Bloom filters is \(O(n)\).
-
- The total in-memory size of all Bloom filters is the number of bits per physical entry multiplied by the number of physical entries.
- The required number of bits per physical entry is determined by the Bloom filter allocation strategy, which is determined by the @confBloomFilterAlloc@ field of @TableConfig@.
-
- [@AllocFixed bitsPerPhysicalEntry@]:
- The number of bits per physical entry is specified as @bitsPerPhysicalEntry@.
- [@AllocRequestFPR requestedFPR@]:
- The number of bits per physical entry is determined by the requested false-positive rate, which is specified as @requestedFPR@.
-
- The false-positive rate scales exponentially with the number of bits per entry:
-
- +---------------------------+---------------------+
- | False-positive rate | Bits per entry |
- +===========================+=====================+
- | \(1\text{ in }10\) | \(\approx 4.77 \) |
- +---------------------------+---------------------+
- | \(1\text{ in }100\) | \(\approx 9.85 \) |
- +---------------------------+---------------------+
- | \(1\text{ in }1{,}000\) | \(\approx 15.79 \) |
- +---------------------------+---------------------+
- | \(1\text{ in }10{,}000\) | \(\approx 22.58 \) |
- +---------------------------+---------------------+
- | \(1\text{ in }100{,}000\) | \(\approx 30.22 \) |
- +---------------------------+---------------------+
-
- * The worst-case in-memory size of the indexes is \(O(n)\).
-
- The total in-memory size of all indexes depends on the index type, which is determined by the @confFencePointerIndex@ field of @TableConfig@.
- The in-memory size of the various indexes is described in reference to the size of the database in [/memory pages/](https://en.wikipedia.org/wiki/Page_%28computer_memory%29).
-
- [@OrdinaryIndex@]:
- An ordinary index stores the maximum serialised key for each memory page.
- The total in-memory size of all indexes is proportional to the average size of one serialised key per memory page.
- [@CompactIndex@]:
- A compact index stores the 64 most significant bits of the minimum serialised key for each memory page, as well as 1 bit per memory page to resolve clashes, 1 bit per memory page to mark overflow pages, and a negligible amount of memory for tie breakers.
- The total in-memory size of all indexes is approximately 66 bits per memory page.
+ For a discussion of how the sizes of these components are determined by the table configuration, see [Fine-tuning Table Configuration](#fine_tuning).
The total size of an LSM-tree must not exceed \(2^{41}\) physical entries.
Violation of this condition /is/ checked and will throw a 'TableTooLargeError'.
- == Implementation
+ === Fine-tuning Table Configuration #fine_tuning#
+
+ [@confMergePolicy@]
+ The /merge policy/ balances the performance of lookups against the performance of updates.
+ Levelling favours lookups.
+ Tiering favours updates.
+ Lazy levelling strikes a middle ground between levelling and tiering, and moderately favours updates.
+ This parameter is explicitly referenced in the documentation of those operations it affects.
+
+ [@confSizeRatio@]
+ The /size ratio/ pushes the effects of the merge policy to the extreme.
+ If the size ratio is higher, levelling favours lookups more, and tiering and lazy levelling favour updates more.
+ This parameter is referred to as \(T\) in the disk I\/O cost of operations.
+
+ [@confWriteBufferAlloc@]
+ The /write buffer capacity/ balances the performance of lookups and updates against the in-memory size of the table.
+ If the write buffer is larger, it takes up more memory, but lookups and updates are more efficient.
+ This parameter is referred to as \(B\) in the disk I\/O cost of operations.
+ Irrespective of this parameter, the write buffer size cannot exceed 4GiB.
+
+ [@confMergeSchedule@]
+ The /merge schedule/ balances the performance of lookups and updates against the smooth performance of updates.
+ The merge schedule does not affect the performance of table unions.
+ With the one-shot merge schedule, lookups and updates are more efficient overall, but some updates may take much longer than others.
+ With the incremental merge schedule, lookups and updates are less efficient overall, but each update does a similar amount of work.
+ This parameter is explicitly referenced in the documentation of those operations it affects.
+
+ [@confBloomFilterAlloc@]
+ The Bloom filter size balances the performance of lookups against the in-memory size of the table.
+ If the Bloom filters are larger, they take up more memory, but lookup operations are more efficient.
+
+ [@confFencePointerIndex@]
+ The /fence-pointer index type/ supports two types of indexes.
+ The /ordinary/ indexes are designed to work with any key.
+ The /compact/ indexes are optimised for the case where the keys in the database are uniformly distributed, e.g., when the keys are hashes.
+
+ [@confDiskCachePolicy@]
+ The /disk cache policy/ determines if lookup operations use the OS page cache.
+ Caching may improve the performance of lookups if database access follows certain patterns.
+
+ ==== Fine-tuning: Merge Policy, Size Ratio, and Write Buffer Size #fine_tuning_data_layout#
+
+ The configuration parameters @confMergePolicy@, @confSizeRatio@, and @confWriteBufferAlloc@ affect how the table organises its data.
+ To understand what effect these parameters have, one must have a basic understand of how an LSM-tree stores its data.
+ The physical entries in an LSM-tree are key–operation pairs, which pair a key with an operation such as an @Insert@ with a value or a @Delete@.
+ These key–operation pairs are organised into /runs/, which are sequences of key–operation pairs sorted by their key.
+ Runs are organised into /levels/, which are unordered sequences or runs.
+ Levels are organised hierarchically.
+ Level 0 is kept in memory, and is referred to as the /write buffer/.
+ All subsequent levels are stored on disk, with each run stored in its own file.
+ The following shows an example LSM-tree layout, with each run as a boxed sequence of keys and each level as a row.
+
+ \[
+ \begin{array}{l:l}
+ \text{Level}
+ &
+ \text{Data}
+ \\
+ 0
+ &
+ \fbox{\(\texttt{4}\,\_\)}
+ \\
+ 1
+ &
+ \fbox{\(\texttt{1}\,\texttt{3}\)}
+ \quad
+ \fbox{\(\texttt{2}\,\texttt{7}\)}
+ \\
+ 2
+ &
+ \fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)}
+ \end{array}
+ \]
+
+ The data in an LSM-tree is /partially sorted/: only the key–operation pairs within each run are sorted and deduplicated.
+ As a rule of thumb, keeping more of the data sorted means lookup operations are faster but update operations are slower.
+
+ The configuration parameters @confMergePolicy@, @confSizeRatio@, and @confWriteBufferAlloc@ directly affect a table's data layout.
+ The parameter @confWriteBufferAlloc@ determines the capacity of the write buffer.
+
+ [@AllocNumEntries maxEntries@]:
+ The write buffer can contain at most @maxEntries@ entries.
+ The constant \(B\) refers to the value of @maxEntries@.
+ Irrespective of this parameter, the write buffer size cannot exceed 4GiB.
+
+ The parameter @confSizeRatio@ determines the ratio between the capacities of successive levels.
+ The constant \(T\) refers to the value of @confSizeRatio@.
+ For instance, if \(B = 2\) and \(T = 2\), then
+
+ \[
+ \begin{array}{l:l}
+ \text{Level} & \text{Capacity}
+ \\
+ 0 & B \cdot T^0 = 2
+ \\
+ 1 & B \cdot T^1 = 4
+ \\
+ 2 & B \cdot T^2 = 8
+ \\
+ \ell & B \cdot T^\ell
+ \end{array}
+ \]
+
+ The merge policy @confMergePolicy@ determines the number of runs per level.
+ In a /tiering/ LSM-tree, each level contains \(T\) runs.
+ In a /levelling/ LSM-tree, each level contains one single run.
+ The /lazy levelling/ policy uses levelling only for the last level and uses tiering for all preceding levels.
+ The previous example used lazy levelling.
+ The following examples illustrate the different merge policies using the same data, assuming \(B = 2\) and \(T = 2\).
+
+ \[
+ \begin{array}{l:l:l:l}
+ \text{Level}
+ &
+ \text{Tiering}
+ &
+ \text{Levelling}
+ &
+ \text{Lazy Levelling}
+ \\
+ 0
+ &
+ \fbox{\(\texttt{4}\,\_\)}
+ &
+ \fbox{\(\texttt{4}\,\_\)}
+ &
+ \fbox{\(\texttt{4}\,\_\)}
+ \\
+ 1
+ &
+ \fbox{\(\texttt{1}\,\texttt{3}\)}
+ \quad
+ \fbox{\(\texttt{2}\,\texttt{7}\)}
+ &
+ \fbox{\(\texttt{1}\,\texttt{2}\,\texttt{3}\,\texttt{7}\)}
+ &
+ \fbox{\(\texttt{1}\,\texttt{3}\)}
+ \quad
+ \fbox{\(\texttt{2}\,\texttt{7}\)}
+ \\
+ 2
+ &
+ \fbox{\(\texttt{4}\,\texttt{5}\,\texttt{7}\,\texttt{8}\)}
+ \quad
+ \fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{9}\)}
+ &
+ \fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)}
+ &
+ \fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)}
+ \end{array}
+ \]
+
+ Tiering favours the performance of updates.
+ Levelling favours the performance of lookups.
+ Lazy levelling strikes a middle ground between tiering and levelling.
+ It favours the performance of lookup operations for the oldest data and enables more deduplication,
+ without the impact that full levelling has on update operations.
+
+ ==== Fine-tuning: Merge Schedule #fine_tuning_merge_schedule#
+
+ The configuration parameter @confMergeSchedule@ affects the worst-case performance of lookup and update operations and the structure of runs.
+ Regardless of the merge schedule, the amortised disk I\/O complexity of lookups and updates is /logarithmic/ in the size of the table.
+ When the write buffer fills up, its contents are flushed to disk as a run and added to level 1.
+ When some level fills up, its contents are flushed down to the next level.
+ Eventually, as data is flushed down, runs must be merged.
+ This package supports two schedules for merging:
+
+ * Using the @OneShot@ merge schedule, runs must always be kept fully sorted and deduplicated.
+ However, flushing a run down to the next level may cause the next level to fill up,
+ in which case it too must be flushed and merged futher down.
+ In the worst case, this can cascade down the entire table.
+ Consequently, the worst-case disk I\/O complexity of updates is /linear/ in the size of the table.
+ This is unacceptable for real-time systems and other use cases where unresponsiveness is unacceptable.
+ * Using the @Incremental@ merge schedule, runs can be /partially merged/, which allows the merging work to be spead out evenly across all update operations.
+ This aligns the worst-case and average-case disk I\/O complexity of updates—both are /logarithmic/ in the size of the table.
+ The cost is a small constant overhead for both lookup and update operations.
+
+ The merge schedule does not affect the performance of table unions.
+ The amortised disk I\/O complexity of one-shot unions is /linear/ in the size of the tables.
+ Instead, there are separate operations for incremental and oneshot unions.
+ For incremental unions, it is up to the user to spread the merging work out evenly over time.
+
+ ==== Fine-tuning: Bloom Filter Size #fine_tuning_bloom_filter_size#
+
+ The configuration parameter @confBloomFilterAlloc@ affects the size of the Bloom filters,
+ which balances the performance of lookups against the in-memory size of the table.
+
+ Tables maintain a [Bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) in memory for each run on disk.
+ These Bloom filter are probablilistic datastructure that are used to track which keys are present in their corresponding run.
+ Querying a Bloom filter returns either \"maybe\" meaning the key is possibly in the run or \"no\" meaning the key is definitely not in the run.
+ When a query returns \"maybe\" while the key is /not/ in the run, this is referred to as a /false positive/.
+ While the database executes a lookup operation, any Bloom filter query that returns a false positive causes the database to unnecessarily read a run from disk.
+ The probabliliy of these spurious reads follow a [binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution) \(\text{Binomial}(r,\text{FPR})\)
+ where \(r\) refers to the number of runs and \(\text{FPR}\) refers to the false-positive rate of the Bloom filters.
+ Hence, the expected number of spurious reads for each lookup operation is \(r\cdot\text{FPR}\).
+ The number of runs \(r\) is proportional to the number of physical entries in the table. Its exact value depends on the merge policy of the table:
+
+ [@LazyLevelling@]
+ \(r = T (\log_T\frac{n}{B} - 1) + 1\).
+
+ The false-positive rate scales exponentially with size of the Bloom filters in bits per entry.
+
+ +---------------------------+----------------------+
+ | False-positive rate (FPR) | Bits per entry (BPE) |
+ +===========================+======================+
+ | \(1\text{ in }10\) | \(\approx 4.77 \) |
+ +---------------------------+----------------------+
+ | \(1\text{ in }100\) | \(\approx 9.85 \) |
+ +---------------------------+----------------------+
+ | \(1\text{ in }1{,}000\) | \(\approx 15.78 \) |
+ +---------------------------+----------------------+
+ | \(1\text{ in }10{,}000\) | \(\approx 22.57 \) |
+ +---------------------------+----------------------+
+ | \(1\text{ in }100{,}000\) | \(\approx 30.22 \) |
+ +---------------------------+----------------------+
+
+ The configuration parameter @confBloomFilterAlloc@ can be specified in two ways:
+
+ [@AllocFixed bitsPerEntry@]
+ Allocate the requested number of bits per entry in the table.
+
+ The value must strictly positive, but fractional values are permitted.
+ The recommended range is \([2, 24]\).
+
+ [@AllocRequestFPR falsePositiveRate@]
+ Allocate the required number of bits per entry to get the requested false-positive rate.
+
+ The value must be in the range \((0, 1)\).
+ The recommended range is \([1\mathrm{e}{ -5 },1\mathrm{e}{ -2 }]\).
+
+ The total in-memory size of all Bloom filters scales /linearly/ with the number of physical entries in the table and is determined by the number of physical entries multiplied by the number of bits per physical entry, i.e, \(n\cdot\text{BPE}\).
+ Let us consider a table with 100 million physical entries which uses the default table configuration for every parameter other than the Bloom filter size.
+
+ +---------------------------+----------------------+------------------------------------------------------------------+
+ | False-positive rate (FPR) | Bloom filter size | Expected spurious reads per lookup |
+ +===========================+======================+==================================================================+
+ | \(1\text{ in }10\) | \( 56.86\text{MiB}\) | \( 2.56\text{ spurious reads every lookup }\) |
+ +---------------------------+----------------------+------------------------------------------------------------------+
+ | \(1\text{ in }100\) | \(117.42\text{MiB}\) | \( 1 \text{ spurious read every } 3.91\text{ lookups }\) |
+ +---------------------------+----------------------+------------------------------------------------------------------+
+ | \(1\text{ in }1{,}000\) | \(188.11\text{MiB}\) | \( 1 \text{ spurious read every } 39.10\text{ lookups }\) |
+ +---------------------------+----------------------+------------------------------------------------------------------+
+ | \(1\text{ in }10{,}000\) | \(269.06\text{MiB}\) | \( 1 \text{ spurious read every } 391.01\text{ lookups }\) |
+ +---------------------------+----------------------+------------------------------------------------------------------+
+ | \(1\text{ in }100{,}000\) | \(360.25\text{MiB}\) | \( 1 \text{ spurious read every } 3910.19\text{ lookups }\) |
+ +---------------------------+----------------------+------------------------------------------------------------------+
+
+ ==== Fine-tuning: Fence-Pointer Index Type #fine_tuning_fence_pointer_index_type#
+
+ The configuration parameter @confFencePointerIndex@ affects the type and size of the fence-pointer indexes.
+ Tables maintain a fence-pointer index in memory for each run on disk.
+ These fence-pointer indexes store the keys at the boundaries of each page of memory to ensure that each lookup has to read at most one page of memory from each run.
+ Tables support two types of fence-pointer indexes:
+
+ [@OrdinaryIndex@]
+ Ordinary indexes are designed for any use case.
+
+ Ordinary indexes store one serialised key per page of memory.
+ The total in-memory size of all indexes is \(K \cdot \frac{n}{P}\) bits,
+ where \(K\) refers to the average size of a serialised key in bits.
+
+ [@CompactIndex@]
+ Compact indexes are designed for the use case where the keys in the table are uniformly distributed, such as when using hashes.
+
+ Compact indexes store the 64 most significant bits of the minimum serialised key of each page of memory.
+ This requires that serialised keys are /at least/ 64 bits in size.
+ Compact indexes store 1 additional bit per page of memory to resolve collisions, 1 additional bit per page of memory to mark entries that are larger than one page, and a negligible amount of memory for tie breakers.
+ The total in-memory size of all indexes is \(66 \cdot \frac{n}{P}\) bits.
+
+ ==== Fine-tuning: Disk Cache Policy #fine_tuning_disk_cache_policy#
+
+ The configuration parameter @confDiskCachePolicy@ determines how the database uses the OS page cache.
+ This may improve performance if the database's /access pattern/ has good /temporal locality/ or good /spatial locality/.
+ The database's access pattern refers to the pattern by which entries are accessed by lookup operations.
+ An access pattern has good temporal locality if it is likely to access entries that were recently accessed or updated.
+ An access pattern has good spatial locality if it is likely to access entries that have nearby keys.
+
+ * Use the @DiskCacheAll@ policy if the database's access pattern has either good spatial locality or both good spatial and temporal locality.
+ * Use the @DiskCacheLevelOneTo l@ policy if the database's access pattern has good temporal locality for updates only.
+ The variable @l@ determines the number of levels that are cached.
+ For a description of levels, see [Merge Policy, Size Ratio, and Write Buffer Size](#fine_tuning_data_layout).
+ With this setting, the database can be expected to cache up to \(\frac{k}{P}\) pages of memory,
+ where \(k\) refers to the number of entries that fit in levels \([1,l]\) and is defined as \(\sum_{i=1}^{l}BT^{i}\).
+ * Use the @DiskCacheNone@ policy if the database's access pattern has does not have good spatial or temporal locality.
+ For instance, if the access pattern is uniformly random.
+
+ == References
The implementation of LSM-trees in this package draws inspiration from:
diff --git a/scripts/generate-readme.hs b/scripts/generate-readme.hs
index 743203064..4fdec09fb 100755
--- a/scripts/generate-readme.hs
+++ b/scripts/generate-readme.hs
@@ -7,7 +7,8 @@ build-depends:
, pandoc ^>=3.6.4
, text >=2.1
-}
-{-# LANGUAGE LambdaCase #-}
+{-# LANGUAGE LambdaCase #-}
+{-# LANGUAGE OverloadedStrings #-}
module Main (main) where
@@ -22,7 +23,7 @@ import qualified Distribution.Types.PackageDescription as PackageDescription
import Distribution.Utils.ShortText (fromShortText)
import System.IO (hPutStrLn, stderr)
import Text.Pandoc (runIOorExplode)
-import Text.Pandoc.Extensions (githubMarkdownExtensions)
+import Text.Pandoc.Extensions (getDefaultExtensions)
import Text.Pandoc.Options (ReaderOptions (..), WriterOptions (..),
def)
import Text.Pandoc.Readers (readHaddock)
@@ -45,6 +46,6 @@ main = do
runIOorExplode $ do
doc1 <- readHaddock def description
let doc2 = headerShift 1 doc1
- writeMarkdown def{writerExtensions = githubMarkdownExtensions} doc2
+ writeMarkdown def{writerExtensions = getDefaultExtensions "gfm"} doc2
let readme = T.unlines [readmeHeaderContent, body]
TIO.writeFile "README.md" readme
diff --git a/src/Database/LSMTree.hs b/src/Database/LSMTree.hs
index 6ffbb7b02..f9fdea764 100644
--- a/src/Database/LSMTree.hs
+++ b/src/Database/LSMTree.hs
@@ -113,13 +113,12 @@ module Database.LSMTree (
),
defaultTableConfig,
MergePolicy (LazyLevelling),
+ MergeSchedule (..),
SizeRatio (Four),
WriteBufferAlloc (AllocNumEntries),
BloomFilterAlloc (AllocFixed, AllocRequestFPR),
- defaultBloomFilterAlloc,
FencePointerIndexType (OrdinaryIndex, CompactIndex),
DiskCachePolicy (..),
- MergeSchedule (..),
-- ** Table Configuration Overrides #table_configuration_overrides#
OverrideDiskCachePolicy (..),
@@ -156,12 +155,6 @@ module Database.LSMTree (
resolveValidOutput,
resolveAssociativity,
- -- * Tracer
- Tracer,
- LSMTreeTrace (..),
- TableTrace (..),
- CursorTrace (..),
-
-- * Errors #errors#
SessionDirDoesNotExistError (..),
SessionDirLockedError (..),
@@ -178,6 +171,24 @@ module Database.LSMTree (
BlobRefInvalidError (..),
CursorClosedError (..),
InvalidSnapshotNameError (..),
+
+ -- * Traces #traces#
+ Tracer,
+ LSMTreeTrace (..),
+ TableTrace (..),
+ CursorTrace (..),
+ MergeTrace (..),
+ CursorId (..),
+ TableId (..),
+ AtLevel (..),
+ LevelNo (..),
+ NumEntries (..),
+ RunNumber (..),
+ MergePolicyForLevel (..),
+ LevelMergeType (..),
+ RunParams (..),
+ RunDataCaching (..),
+ IndexType (..),
) where
import Control.Concurrent.Class.MonadMVar.Strict (MonadMVar)
@@ -203,17 +214,24 @@ import qualified Database.LSMTree.Internal.BlobRef as Internal
import Database.LSMTree.Internal.Config
(BloomFilterAlloc (AllocFixed, AllocRequestFPR),
DiskCachePolicy (..), FencePointerIndexType (..),
- MergePolicy (..), MergeSchedule (..), SizeRatio (..),
- TableConfig (..), WriteBufferAlloc (..),
- defaultBloomFilterAlloc, defaultTableConfig,
- serialiseKeyMinimalSize)
+ LevelNo (..), MergePolicy (..), MergeSchedule (..),
+ SizeRatio (..), TableConfig (..), WriteBufferAlloc (..),
+ defaultTableConfig, serialiseKeyMinimalSize)
import Database.LSMTree.Internal.Config.Override
(OverrideDiskCachePolicy (..))
+import Database.LSMTree.Internal.Entry (NumEntries (..))
import qualified Database.LSMTree.Internal.Entry as Entry
+import Database.LSMTree.Internal.Merge (LevelMergeType (..))
+import Database.LSMTree.Internal.MergeSchedule (AtLevel (..),
+ MergePolicyForLevel (..), MergeTrace (..))
import Database.LSMTree.Internal.Paths (SnapshotName,
isValidSnapshotName, toSnapshotName)
import Database.LSMTree.Internal.Range (Range (..))
import Database.LSMTree.Internal.RawBytes (RawBytes (..))
+import Database.LSMTree.Internal.RunBuilder (IndexType (..),
+ RunDataCaching (..), RunParams (..))
+import Database.LSMTree.Internal.RunNumber (CursorId (..),
+ RunNumber (..), TableId (..))
import qualified Database.LSMTree.Internal.Serialise as Internal
import Database.LSMTree.Internal.Serialise.Class (SerialiseKey (..),
SerialiseKeyOrderPreserving, SerialiseValue (..),
diff --git a/src/Database/LSMTree/Internal/Config.hs b/src/Database/LSMTree/Internal/Config.hs
index 48f864875..4a86cce96 100644
--- a/src/Database/LSMTree/Internal/Config.hs
+++ b/src/Database/LSMTree/Internal/Config.hs
@@ -16,7 +16,6 @@ module Database.LSMTree.Internal.Config (
, WriteBufferAlloc (..)
-- * Bloom filter allocation
, BloomFilterAlloc (..)
- , defaultBloomFilterAlloc
, bloomFilterAllocForLevel
-- * Fence pointer index
, FencePointerIndexType (..)
@@ -27,7 +26,6 @@ module Database.LSMTree.Internal.Config (
, diskCachePolicyForLevel
-- * Merge schedule
, MergeSchedule (..)
- , defaultMergeSchedule
) where
import Control.DeepSeq (NFData (..))
@@ -48,26 +46,57 @@ newtype LevelNo = LevelNo Int
Table configuration
-------------------------------------------------------------------------------}
--- | Table configuration parameters, including LSM tree tuning parameters.
---
--- Some config options are fixed (for now):
---
--- * Merge policy: Tiering
---
--- * Size ratio: 4
+{- |
+A collection of configuration parameters for tables, which can be used to tune the performance of the table.
+To construct a 'TableConfig', modify the 'defaultTableConfig', which defines reasonable defaults for all parameters.
+
+For a detailed discussion of fine-tuning the table configuration, see [Fine-tuning Table Configuration](../#fine_tuning).
+
+[@confMergePolicy :: t'MergePolicy'@]
+ The /merge policy/ balances the performance of lookups against the performance of updates.
+ Levelling favours lookups.
+ Tiering favours updates.
+ Lazy levelling strikes a middle ground between levelling and tiering, and moderately favours updates.
+ This parameter is explicitly referenced in the documentation of those operations it affects.
+
+[@confSizeRatio :: t'SizeRatio'@]
+ The /size ratio/ pushes the effects of the merge policy to the extreme.
+ If the size ratio is higher, levelling favours lookups more, and tiering and lazy levelling favour updates more.
+ This parameter is referred to as \(T\) in the disk I\/O cost of operations.
+
+[@confWriteBufferAlloc :: t'WriteBufferAlloc'@]
+ The /write buffer capacity/ balances the performance of lookups and updates against the in-memory size of the database.
+ If the write buffer is larger, it takes up more memory, but lookups and updates are more efficient.
+ This parameter is referred to as \(B\) in the disk I\/O cost of operations.
+ Irrespective of this parameter, the write buffer size cannot exceed 4GiB.
+
+[@confMergeSchedule :: t'MergeSchedule'@]
+ The /merge schedule/ balances the performance of lookups and updates against the consistency of updates.
+ The merge schedule does not affect the performance of table unions.
+ With the one-shot merge schedule, lookups and updates are more efficient overall, but some updates may take much longer than others.
+ With the incremental merge schedule, lookups and updates are less efficient overall, but each update does a similar amount of work.
+ This parameter is explicitly referenced in the documentation of those operations it affects.
+
+[@confBloomFilterAlloc :: t'BloomFilterAlloc'@]
+ The Bloom filter size balances the performance of lookups against the in-memory size of the database.
+ If the Bloom filters are larger, they take up more memory, but lookup operations are more efficient.
+
+[@confFencePointerIndex :: t'FencePointerIndexType'@]
+ The /fence-pointer index type/ supports two types of indexes.
+ The /ordinary/ indexes are designed to work with any key.
+ The /compact/ indexes are optimised for the case where the keys in the database are uniformly distributed, e.g., when the keys are hashes.
+
+[@confDiskCachePolicy :: t'DiskCachePolicy'@]
+ The /disk cache policy/ supports caching lookup operations using the OS page cache.
+ Caching may improve the performance of lookups if database access follows certain patterns.
+-}
data TableConfig = TableConfig {
confMergePolicy :: !MergePolicy
, confMergeSchedule :: !MergeSchedule
- -- Size ratio between the capacities of adjacent levels.
, confSizeRatio :: !SizeRatio
- -- | Total number of bytes that the write buffer can use.
- --
- -- The maximum is 4GiB, which should be more than enough for realistic
- -- applications.
, confWriteBufferAlloc :: !WriteBufferAlloc
, confBloomFilterAlloc :: !BloomFilterAlloc
, confFencePointerIndex :: !FencePointerIndexType
- -- | The policy for caching key\/value data from disk in memory.
, confDiskCachePolicy :: !DiskCachePolicy
}
deriving stock (Show, Eq)
@@ -76,19 +105,31 @@ instance NFData TableConfig where
rnf (TableConfig a b c d e f g) =
rnf a `seq` rnf b `seq` rnf c `seq` rnf d `seq` rnf e `seq` rnf f `seq` rnf g
--- | A reasonable default 'TableConfig'.
+-- | The 'defaultTableConfig' defines reasonable defaults for all 'TableConfig' parameters.
--
--- This uses a write buffer with up to 20,000 elements and a generous amount of
--- memory for Bloom filters (FPR of 1%).
+-- >>> confMergePolicy defaultTableConfig
+-- LazyLevelling
+-- >>> confMergeSchedule defaultTableConfig
+-- Incremental
+-- >>> confSizeRatio defaultTableConfig
+-- Four
+-- >>> confWriteBufferAlloc defaultTableConfig
+-- AllocNumEntries 20000
+-- >>> confBloomFilterAlloc defaultTableConfig
+-- AllocRequestFPR 1.0e-3
+-- >>> confFencePointerIndex defaultTableConfig
+-- OrdinaryIndex
+-- >>> confDiskCachePolicy defaultTableConfig
+-- DiskCacheAll
--
defaultTableConfig :: TableConfig
defaultTableConfig =
TableConfig
{ confMergePolicy = LazyLevelling
- , confMergeSchedule = defaultMergeSchedule
+ , confMergeSchedule = Incremental
, confSizeRatio = Four
, confWriteBufferAlloc = AllocNumEntries 20_000
- , confBloomFilterAlloc = defaultBloomFilterAlloc
+ , confBloomFilterAlloc = AllocRequestFPR 1.0e-3
, confFencePointerIndex = OrdinaryIndex
, confDiskCachePolicy = DiskCacheAll
}
@@ -107,12 +148,19 @@ runParamsForLevel conf@TableConfig {..} levelNo =
Merge policy
-------------------------------------------------------------------------------}
+{- |
+The /merge policy/ balances the performance of lookups against the performance of updates.
+Levelling favours lookups.
+Tiering favours updates.
+Lazy levelling strikes a middle ground between levelling and tiering, and moderately favours updates.
+This parameter is explicitly referenced in the documentation of those operations it affects.
+
+__NOTE:__ This package only supports lazy levelling.
+
+For a detailed discussion of the merge policy, see [Fine-tuning: Merge Policy, Size Ratio, and Write Buffer Size](../#fine_tuning_data_layout).
+-}
data MergePolicy =
- -- | Use tiering on intermediate levels, and levelling on the last level.
- -- This makes it easier for delete operations to disappear on the last
- -- level.
LazyLevelling
- -- TODO: add other merge policies, like tiering and levelling.
deriving stock (Eq, Show)
instance NFData MergePolicy where
@@ -122,6 +170,15 @@ instance NFData MergePolicy where
Size ratio
-------------------------------------------------------------------------------}
+{- |
+The /size ratio/ pushes the effects of the merge policy to the extreme.
+If the size ratio is higher, levelling favours lookups more, and tiering and lazy levelling favour updates more.
+This parameter is referred to as \(T\) in the disk I\/O cost of operations.
+
+__NOTE:__ This package only supports a size ratio of four.
+
+For a detailed discussion of the size ratio, see [Fine-tuning: Merge Policy, Size Ratio, and Write Buffer Size](../#fine_tuning_data_layout).
+-}
data SizeRatio = Four
deriving stock (Eq, Show)
@@ -135,53 +192,83 @@ sizeRatioInt = \case Four -> 4
Write buffer allocation
-------------------------------------------------------------------------------}
--- | Allocation method for the write buffer.
+-- TODO: "If the sizes of values vary greatly, this can lead to unevenly sized runs on disk and unpredictable performance."
+
+{- |
+The /write buffer capacity/ balances the performance of lookups and updates against the in-memory size of the table.
+If the write buffer is larger, it takes up more memory, but lookups and updates are more efficient.
+Irrespective of this parameter, the write buffer size cannot exceed 4GiB.
+
+For a detailed discussion of the size ratio, see [Fine-tuning: Merge Policy, Size Ratio, and Write Buffer Size](../#fine_tuning_data_layout).
+-}
data WriteBufferAlloc =
- -- | Total number of key\/value pairs that can be present in the write
- -- buffer before flushing the write buffer to disk.
- --
- -- NOTE: if the sizes of values vary greatly, this can lead to wonky runs on
- -- disk, and therefore unpredictable performance.
+ {- |
+ Allocate space for the in-memory write buffer to fit the requested number of entries.
+ This parameter is referred to as \(B\) in the disk I\/O cost of operations.
+ -}
AllocNumEntries !Int
deriving stock (Show, Eq)
instance NFData WriteBufferAlloc where
rnf (AllocNumEntries n) = rnf n
+{-------------------------------------------------------------------------------
+ Merge schedule
+-------------------------------------------------------------------------------}
+
+{- |
+The /merge schedule/ balances the performance of lookups and updates against the consistency of updates.
+The merge schedule does not affect the performance of table unions.
+With the one-shot merge schedule, lookups and updates are more efficient overall, but some updates may take much longer than others.
+With the incremental merge schedule, lookups and updates are less efficient overall, but each update does a similar amount of work.
+This parameter is explicitly referenced in the documentation of those operations it affects.
+
+For a detailed discussion of the effect of the merge schedule, see [Fine-tuning: Merge Schedule](../#fine_tuning_merge_schedule).
+-}
+data MergeSchedule =
+ {- |
+ The 'OneShot' merge schedule causes the merging algorithm to complete merges immediately.
+ This is more efficient than the 'Incremental' merge schedule, but has an inconsistent workload.
+ Using the 'OneShot' merge schedule, the worst-case disk I\/O complexity of the update operations is /linear/ in the size of the table.
+ For real-time systems and other use cases where unresponsiveness is unacceptable, use the 'Incremental' merge schedule.
+ -}
+ OneShot
+ {- |
+ The 'Incremental' merge schedule spreads out the merging work over time.
+ This is less efficient than the 'OneShot' merge schedule, but has a consistent workload.
+ Using the 'Incremental' merge schedule, the worst-case disk I\/O complexity of the update operations is /logarithmic/ in the size of the table.
+ -}
+ | Incremental
+ deriving stock (Eq, Show)
+
+instance NFData MergeSchedule where
+ rnf OneShot = ()
+ rnf Incremental = ()
+
{-------------------------------------------------------------------------------
Bloom filter allocation
-------------------------------------------------------------------------------}
--- | Allocation method for bloom filters.
---
--- NOTE: a __physical__ database entry is a key\/operation pair that exists in a
--- file, i.e., a run. Multiple physical entries that have the same key
--- constitute a __logical__ database entry.
---
--- There is a trade-off between bloom filter memory size, and the false
--- positive rate. A higher false positive rate (FPR) leads to more unnecessary
--- I\/O. As a guide, here are some points on the trade-off:
---
--- * FPR of 1e-2 requires approximately 9.9 bits per element
--- * FPR of 1e-3 requires approximately 15.8 bits per element
--- * FPR of 1e-4 requires approximately 22.6 bits per element
---
--- The policy can be specified either by fixing a FPR or by fixing the number
--- of bits per entry.
---
+{- |
+The Bloom filter size balances the performance of lookups against the in-memory size of the table.
+If the Bloom filters are larger, they take up more memory, but lookup operations are more efficient.
+
+For a detailed discussion of the Bloom filter size, see [Fine-tuning: Bloom Filter Size](../#fine_tuning_bloom_filter_size).
+-}
data BloomFilterAlloc =
- -- | Allocate a fixed number of bits per physical entry in each bloom
- -- filter. Non-integer values are legal. Once the number of entries is know,
- -- the number of bits is rounded.
- --
- -- The value must strictly positive, 0 < x. Sane values are 2 .. 24.
- --
+ {- |
+ Allocate the requested number of bits per entry in the table.
+
+ The value must strictly positive, but fractional values are permitted.
+ The recommended range is \([2, 24]\).
+ -}
AllocFixed !Double
- | -- | Allocate as many bits as required per physical entry to get the requested
- -- false-positive rate. Do this for each bloom filter.
- --
- -- The value must be in the range 0 < x < 1. Sane values are 1e-2 .. 1e-5.
- --
+ | {- |
+ Allocate the required number of bits per entry to get the requested false-positive rate.
+
+ The value must be in the range \((0, 1)\).
+ The recommended range is \([1\mathrm{e}{ -5 },1\mathrm{e}{ -2 }]\).
+ -}
AllocRequestFPR !Double
deriving stock (Show, Eq)
@@ -189,9 +276,6 @@ instance NFData BloomFilterAlloc where
rnf (AllocFixed n) = rnf n
rnf (AllocRequestFPR fpr) = rnf fpr
-defaultBloomFilterAlloc :: BloomFilterAlloc
-defaultBloomFilterAlloc = AllocRequestFPR 1e-3
-
bloomFilterAllocForLevel :: TableConfig -> RunLevelNo -> RunBloomFilterAlloc
bloomFilterAllocForLevel conf _levelNo =
case confBloomFilterAlloc conf of
@@ -202,27 +286,31 @@ bloomFilterAllocForLevel conf _levelNo =
Fence pointer index
-------------------------------------------------------------------------------}
--- | Configure the type of fence pointer index.
+{- |
+The /fence-pointer index type/ supports two types of indexes.
+The /ordinary/ indexes are designed to work with any key.
+The /compact/ indexes are optimised for the case where the keys in the database are uniformly distributed, e.g., when the keys are hashes.
+
+For a detailed discussion the fence-pointer index types, see [Fine-tuning: Fence-Pointer Index Type](../#fine_tuning_fence_pointer_index_type).
+-}
data FencePointerIndexType =
- -- | Use a compact fence pointer index.
- --
- -- Compact indexes are designed to work with keys that are large (for
- -- example, 32 bytes long) cryptographic hashes.
- --
- -- When using a compact index, it is vital that the
- -- 'Database.LSMTree.Internal.Serialise.Class.serialiseKey' function
- -- satisfies the following law:
- --
- -- [Minimal size] @'Database.LSMTree.Internal.RawBytes.size'
- -- ('Database.LSMTree.Internal.Serialise.Class.serialiseKey' x) >= 8@
- --
- -- Use 'serialiseKeyMinimalSize' to test this law.
+ {- |
+ Ordinary indexes are designed to work with any key.
+
+ When using an ordinary index, the 'Database.LSMTree.Internal.Serialise.Class.serialiseKey' function cannot produce output larger than 64KiB.
+ -}
+ OrdinaryIndex
+ | {- |
+ Compact indexes are designed for the case where the keys in the database are uniformly distributed, e.g., when the keys are hashes.
+
+ When using a compact index, the 'Database.LSMTree.Internal.Serialise.Class.serialiseKey' function must satisfy the following additional law:
+
+ [Minimal size]
+ @'Database.LSMTree.Internal.RawBytes.size' ('Database.LSMTree.Internal.Serialise.Class.serialiseKey' x) >= 8@
+
+ Use 'serialiseKeyMinimalSize' to test this law.
+ -}
CompactIndex
- -- | Use an ordinary fence pointer index
- --
- -- Ordinary indexes do not have any constraints on keys other than that
- -- their serialised forms may not be 64 KiB or more in size.
- | OrdinaryIndex
deriving stock (Eq, Show)
instance NFData FencePointerIndexType where
@@ -241,48 +329,41 @@ serialiseKeyMinimalSize x = RB.size (serialiseKey x) >= 8
Disk cache policy
-------------------------------------------------------------------------------}
--- | The policy for caching data from disk in memory (using the OS page cache).
---
--- Caching data in memory can improve performance if the access pattern has
--- good access locality or if the overall data size fits within memory. On the
--- other hand, caching is detrimental to performance and wastes memory if the
--- access pattern has poor spatial or temporal locality.
---
--- This implementation is designed to have good performance using a cacheless
--- policy, where main memory is used only to cache Bloom filters and indexes,
--- but none of the key\/value data itself. Nevertheless, some use cases will be
--- faster if some or all of the key\/value data is also cached in memory. This
--- implementation does not do any custom caching of key\/value data, relying
--- simply on the OS page cache. Thus caching is done in units of 4kb disk pages
--- (as opposed to individual key\/value pairs for example).
---
-data DiskCachePolicy =
+{- |
+The /disk cache policy/ determines if lookup operations use the OS page cache.
+Caching may improve the performance of lookups if database access follows certain patterns.
- -- | Use the OS page cache to cache any\/all key\/value data in the
- -- table.
- --
- -- Use this policy if the expected access pattern for the table
- -- has a good spatial or temporal locality.
- DiskCacheAll
-
- -- | Use the OS page cache to cache data in all LSMT levels from 0 to
- -- a given level number. For example, use 1 to cache the first level.
- -- (The write buffer is considered to be level 0.)
- --
- -- Use this policy if the expected access pattern for the table
- -- has good temporal locality for recently inserted keys.
- | DiskCacheLevelOneTo !Int
-
- --TODO: Add a policy based on size in bytes rather than internal details
- -- like levels. An easy use policy would be to say: "cache the first 10
- -- Mb" and have everything worked out from that.
-
- -- | Do not cache any key\/value data in any level (except the write
- -- buffer).
- --
- -- Use this policy if expected access pattern for the table has poor
- -- spatial or temporal locality, such as uniform random access.
- | DiskCacheNone
+For a detailed discussion the disk cache policy, see [Fine-tuning: Disk Cache Policy](../#fine_tuning_disk_cache_policy).
+-}
+data DiskCachePolicy =
+ {- |
+ Cache all data in the table.
+
+ Use this policy if the database's access pattern has either good spatial locality or both good spatial and temporal locality.
+ -}
+ DiskCacheAll
+
+ | {- |
+ Cache the data in the freshest @l@ levels.
+
+ Use this policy if the database's access pattern only has good temporal locality.
+
+ The variable @l@ determines the number of levels that are cached.
+ For a description of levels, see [Merge Policy, Size Ratio, and Write Buffer Size](#fine_tuning_data_layout).
+ With this setting, the database can be expected to cache up to \(\frac{k}{P}\) pages of memory,
+ where \(k\) refers to the number of entries that fit in levels \([1,l]\) and is defined as \(\sum_{i=1}^{l}BT^{i}\).
+ -}
+ -- TODO: Add a policy for caching based on size in bytes, rather than exposing internal details such as levels.
+ -- For instance, a policy that states "cache the freshest 10MiB"
+ DiskCacheLevelOneTo !Int
+
+ | {- |
+ Do not cache any table data.
+
+ Use this policy if the database's access pattern has does not have good spatial or temporal locality.
+ For instance, if the access pattern is uniformly random.
+ -}
+ DiskCacheNone
deriving stock (Show, Eq)
instance NFData DiskCachePolicy where
@@ -303,40 +384,3 @@ diskCachePolicyForLevel policy levelNo =
RegularLevel l | l <= LevelNo n -> CacheRunData
| otherwise -> NoCacheRunData
UnionLevel -> NoCacheRunData
-
-{-------------------------------------------------------------------------------
- Merge schedule
--------------------------------------------------------------------------------}
-
--- | A configuration option that determines how merges are stepped to
--- completion. This does not affect the amount of work that is done by merges,
--- only how the work is spread out over time.
-data MergeSchedule =
- -- | Complete merges immediately when started.
- --
- -- The 'OneShot' option will make the merging algorithm perform /big/ batches
- -- of work in one go, so intermittent slow-downs can be expected. For use
- -- cases where unresponsiveness is unacceptable, e.g. in real-time systems,
- -- use 'Incremental' instead.
- OneShot
- -- | Schedule merges for incremental construction, and step the merge when
- -- updates are performed on a table.
- --
- -- The 'Incremental' option spreads out merging work over time. More
- -- specifically, updates to a table can cause a /small/ batch of merge work
- -- to be performed. The scheduling of these batches is designed such that
- -- merges are fully completed in time for when new merges are started on the
- -- same level.
- | Incremental
- deriving stock (Eq, Show)
-
-instance NFData MergeSchedule where
- rnf OneShot = ()
- rnf Incremental = ()
-
--- | The default 'MergeSchedule'.
---
--- >>> defaultMergeSchedule
--- Incremental
-defaultMergeSchedule :: MergeSchedule
-defaultMergeSchedule = Incremental
diff --git a/src/Database/LSMTree/Internal/Config/Override.hs b/src/Database/LSMTree/Internal/Config/Override.hs
index 6eac28965..a2e7d5877 100644
--- a/src/Database/LSMTree/Internal/Config/Override.hs
+++ b/src/Database/LSMTree/Internal/Config/Override.hs
@@ -48,10 +48,13 @@ import Database.LSMTree.Internal.Snapshot
Override disk cache policy
-------------------------------------------------------------------------------}
--- | Override the 'DiskCachePolicy'
+{- |
+The 'OverrideDiskCachePolicy' can be used to override the 'DiskCachePolicy'
+when opening a table from a snapshot.
+-}
data OverrideDiskCachePolicy =
- OverrideDiskCachePolicy DiskCachePolicy
- | NoOverrideDiskCachePolicy
+ NoOverrideDiskCachePolicy
+ | OverrideDiskCachePolicy DiskCachePolicy
deriving stock (Show, Eq)
-- | Override the disk cache policy that is stored in snapshot metadata.
diff --git a/src/Database/LSMTree/Internal/MergeSchedule.hs b/src/Database/LSMTree/Internal/MergeSchedule.hs
index b515753e2..0690b7188 100644
--- a/src/Database/LSMTree/Internal/MergeSchedule.hs
+++ b/src/Database/LSMTree/Internal/MergeSchedule.hs
@@ -494,7 +494,7 @@ updatesWithInterleavedFlushes tr conf resolve hfs hbio root uc es reg tc = do
(wb', es') <- addWriteBufferEntries hfs resolve wbblobs maxn wb es
-- Supply credits before flushing, so that we complete merges in time. The
-- number of supplied credits is based on the size increase of the write
- -- buffer, not the the number of processed entries @length es' - length es@.
+ -- buffer, not the number of processed entries @length es' - length es@.
let numAdded = unNumEntries (WB.numEntries wb') - unNumEntries (WB.numEntries wb)
supplyCredits conf (NominalCredits numAdded) (tableLevels tc)
let tc' = tc { tableWriteBuffer = wb' }
diff --git a/src/Database/LSMTree/Internal/Range.hs b/src/Database/LSMTree/Internal/Range.hs
index 44aed84db..27421f48a 100644
--- a/src/Database/LSMTree/Internal/Range.hs
+++ b/src/Database/LSMTree/Internal/Range.hs
@@ -13,9 +13,13 @@ import Control.DeepSeq (NFData (..))
-- | A range of keys.
data Range k =
- -- | Inclusive lower bound, exclusive upper bound
+ {- |
+ @'FromToExcluding' i j@ is the ranges from @i@ (inclusive) to @j@ (exclusive).
+ -}
FromToExcluding k k
- -- | Inclusive lower bound, inclusive upper bound
+ {- |
+ @'FromToIncluding' i j@ is the ranges from @i@ (inclusive) to @j@ (inclusive).
+ -}
| FromToIncluding k k
deriving stock (Show, Eq, Functor)
diff --git a/src/Database/LSMTree/Internal/RawBytes.hs b/src/Database/LSMTree/Internal/RawBytes.hs
index abc827bb1..bc4aa412f 100644
--- a/src/Database/LSMTree/Internal/RawBytes.hs
+++ b/src/Database/LSMTree/Internal/RawBytes.hs
@@ -69,6 +69,7 @@ import Prelude hiding (drop, take)
import GHC.Exts
import GHC.Stack
import GHC.Word
+import Text.Printf (printf)
{- Note: [Export structure]
~~~~~~~~~~~~~~~~~~~~~~~
@@ -80,15 +81,30 @@ import GHC.Word
Raw bytes
-------------------------------------------------------------------------------}
--- | Raw bytes with no alignment constraint (i.e. byte aligned), and no
--- guarantee of pinned or unpinned memory (i.e. could be either).
+{- |
+Raw bytes.
+
+This type imposes no alignment constraint and provides no guarantee of whether the memory is pinned or unpinned.
+-}
newtype RawBytes = RawBytes (VP.Vector Word8)
- deriving newtype (Show, NFData)
+ deriving newtype (NFData)
+
+-- TODO: Should we have a more well-behaved instance for 'Show'?
+-- For instance, an instance that prints the bytes as a hexadecimal string?
+deriving newtype instance Show RawBytes
+
+_showBytesAsHex :: RawBytes -> ShowS
+_showBytesAsHex (RawBytes bytes) = VP.foldr ((.) . showByte) id bytes
+ where
+ showByte :: Word8 -> ShowS
+ showByte = showString . printf "%02x"
instance Eq RawBytes where
bs1 == bs2 = compareBytes bs1 bs2 == EQ
--- | Lexicographical 'Ord' instance.
+{- |
+This instance uses lexicographic ordering.
+-}
instance Ord RawBytes where
compare = compareBytes
@@ -113,6 +129,11 @@ instance Hashable RawBytes where
hash :: Word64 -> RawBytes -> Word64
hash salt (RawBytes (VP.Vector off len ba)) = hashByteArray ba off len salt
+{- |
+@'fromList'@: \(O(n)\).
+
+@'toList'@: \(O(n)\).
+-}
instance IsList RawBytes where
type Item RawBytes = Word8
@@ -122,9 +143,13 @@ instance IsList RawBytes where
toList :: RawBytes -> [Item RawBytes]
toList = unpack
--- | Mostly to make test cases shorter to write.
+{- |
+@'fromString'@: \(O(n)\).
+
+__Warning:__ 'fromString' truncates multi-byte characters to octets. e.g. \"枯朶に烏のとまりけり秋の暮\" becomes \"�6k�nh~�Q��n�\".
+-}
instance IsString RawBytes where
- fromString = pack . map (fromIntegral . fromEnum)
+ fromString = fromByteString . fromString
{-------------------------------------------------------------------------------
Accessors
@@ -171,9 +196,19 @@ toWord64 x# = byteSwap64 (W64# x#)
Construction
-------------------------------------------------------------------------------}
+{- |
+@('<>')@: \(O(n)\).
+
+@'Data.Semigroup.sconcat'@: \(O(n)\).
+-}
instance Semigroup RawBytes where
(<>) = coerce (VP.++)
+{- |
+@'mempty'@: \(O(1)\).
+
+@'mconcat'@: \(O(n)\).
+-}
instance Monoid RawBytes where
mempty = coerce VP.empty
mconcat = coerce VP.concat
diff --git a/src/Database/LSMTree/Internal/Serialise/Class.hs b/src/Database/LSMTree/Internal/Serialise/Class.hs
index f83904bff..314cabffc 100644
--- a/src/Database/LSMTree/Internal/Serialise/Class.hs
+++ b/src/Database/LSMTree/Internal/Serialise/Class.hs
@@ -42,13 +42,15 @@ import Numeric (showInt)
SerialiseKey
-------------------------------------------------------------------------------}
--- | Serialisation of keys.
---
--- Instances should satisfy the following laws:
---
--- [Identity] @'deserialiseKey' ('serialiseKey' x) == x@
--- [Identity up to slicing] @'deserialiseKey' ('packSlice' prefix ('serialiseKey' x) suffix) == x@
---
+{- | Serialisation of keys.
+
+Instances should satisfy the following laws:
+
+[Identity]
+ @'deserialiseKey' ('serialiseKey' x) == x@
+[Identity up to slicing]
+ @'deserialiseKey' ('packSlice' prefix ('serialiseKey' x) suffix) == x@
+-}
class SerialiseKey k where
serialiseKey :: k -> RawBytes
-- TODO: 'deserialiseKey' is only strictly necessary for range queries.
@@ -67,26 +69,27 @@ serialiseKeyIdentityUpToSlicing ::
serialiseKeyIdentityUpToSlicing prefix x suffix =
deserialiseKey (packSlice prefix (serialiseKey x) suffix) == x
--- | Order-preserving serialisation of keys
---
--- Internally, the library sorts key\/value pairs using the ordering of
--- /serialised/ keys. Range lookups and cursor reads return key\/value according
--- to this ordering. As such, if serialisation does not preserve the ordering of
--- /unserialised/ keys, then range lookups and cursor reads will return
--- /unserialised/ keys out of order.
---
--- Instances that prevent keys from being returned out of order should satisfy
--- the following law:
---
--- [Ordering-preserving] @x \`'compare'\` y == 'serialiseKey' x \`'compare'\` 'serialiseKey' y@
---
--- Serialised keys (raw bytes) are lexicographically ordered, which means that
--- keys should be serialised into big-endian formats to satisfy the
--- __Ordering-preserving__ law,
---
+{- |
+Order-preserving serialisation of keys.
+
+Table data is sorted by /serialised/ keys.
+Range lookups and cursors return entries in this order.
+If serialisation does not preserve the ordering of /unserialised/ keys,
+then range lookups and cursors return entries out of order.
+
+If the 'SerialiseKey' instance for a type preserves the ordering,
+then it can safely be given an instance of 'SerialiseKeyOrderPreserving'.
+These should satisfy the following law:
+
+[Order-preserving]
+ @x \`'compare'\` y == 'serialiseKey' x \`'compare'\` 'serialiseKey' y@
+
+Serialised keys are lexicographically ordered.
+To satisfy the __Order-preserving__ law, keys should be serialised into a big-endian format.
+-}
class SerialiseKey k => SerialiseKeyOrderPreserving k where
--- | Test the __Ordering-preserving__ law for the 'SerialiseKeyOrderPreserving' class
+-- | Test the __Order-preserving__ law for the 'SerialiseKeyOrderPreserving' class
serialiseKeyPreservesOrdering :: (Ord k, SerialiseKey k) => k -> k -> Bool
serialiseKeyPreservesOrdering x y = x `compare` y == serialiseKey x `compare` serialiseKey y
@@ -94,12 +97,16 @@ serialiseKeyPreservesOrdering x y = x `compare` y == serialiseKey x `compare` se
SerialiseValue
-------------------------------------------------------------------------------}
--- | Serialisation of values and blobs.
---
--- Instances should satisfy the following laws:
---
--- [Identity] @'deserialiseValue' ('serialiseValue' x) == x@
--- [Identity up to slicing] @'deserialiseValue' ('packSlice' prefix ('serialiseValue' x) suffix) == x@
+{- | Serialisation of values and blobs.
+
+Instances should satisfy the following laws:
+
+[Identity]
+ @'deserialiseValue' ('serialiseValue' x) == x@
+
+[Identity up to slicing]
+ @'deserialiseValue' ('packSlice' prefix ('serialiseValue' x) suffix) == x@
+-}
class SerialiseValue v where
serialiseValue :: v -> RawBytes
deserialiseValue :: RawBytes -> v
@@ -147,60 +154,110 @@ requireBytesExactly tyName expected actual x
Int
-------------------------------------------------------------------------------}
+{- |
+@'serialiseKey'@: \(O(1)\).
+
+@'deserialiseKey'@: \(O(1)\).
+-}
instance SerialiseKey Int8 where
serialiseKey x = RB.RawBytes $ byteVectorFromPrim x
deserialiseKey (RawBytes (VP.Vector off len ba)) =
requireBytesExactly "Int8" 1 len $ indexInt8Array ba off
+{- |
+@'serialiseValue'@: \(O(1)\).
+
+@'deserialiseValue'@: \(O(1)\).
+-}
instance SerialiseValue Int8 where
serialiseValue x = RB.RawBytes $ byteVectorFromPrim $ x
deserialiseValue (RawBytes (VP.Vector off len ba)) =
requireBytesExactly "Int8" 1 len $ indexInt8Array ba off
+{- |
+@'serialiseKey'@: \(O(1)\).
+
+@'deserialiseKey'@: \(O(1)\).
+-}
instance SerialiseKey Int16 where
serialiseKey x = RB.RawBytes $ byteVectorFromPrim $ byteSwapInt16 x
deserialiseKey (RawBytes (VP.Vector off len ba)) =
requireBytesExactly "Int16" 2 len $ byteSwapInt16 (indexWord8ArrayAsInt16 ba off)
+{- |
+@'serialiseValue'@: \(O(1)\).
+
+@'deserialiseValue'@: \(O(1)\).
+-}
instance SerialiseValue Int16 where
serialiseValue x = RB.RawBytes $ byteVectorFromPrim $ x
deserialiseValue (RawBytes (VP.Vector off len ba)) =
requireBytesExactly "Int16" 2 len $ indexWord8ArrayAsInt16 ba off
+{- |
+@'serialiseKey'@: \(O(1)\).
+
+@'deserialiseKey'@: \(O(1)\).
+-}
instance SerialiseKey Int32 where
serialiseKey x = RB.RawBytes $ byteVectorFromPrim $ byteSwapInt32 x
deserialiseKey (RawBytes (VP.Vector off len ba)) =
requireBytesExactly "Int32" 4 len $ byteSwapInt32 (indexWord8ArrayAsInt32 ba off)
+{- |
+@'serialiseValue'@: \(O(1)\).
+
+@'deserialiseValue'@: \(O(1)\).
+-}
instance SerialiseValue Int32 where
serialiseValue x = RB.RawBytes $ byteVectorFromPrim $ x
deserialiseValue (RawBytes (VP.Vector off len ba)) =
requireBytesExactly "Int32" 4 len $ indexWord8ArrayAsInt32 ba off
+{- |
+@'serialiseKey'@: \(O(1)\).
+
+@'deserialiseKey'@: \(O(1)\).
+-}
instance SerialiseKey Int64 where
serialiseKey x = RB.RawBytes $ byteVectorFromPrim $ byteSwapInt64 x
deserialiseKey (RawBytes (VP.Vector off len ba)) =
requireBytesExactly "Int64" 8 len $ byteSwapInt64 (indexWord8ArrayAsInt64 ba off)
+{- |
+@'serialiseValue'@: \(O(1)\).
+
+@'deserialiseValue'@: \(O(1)\).
+-}
instance SerialiseValue Int64 where
serialiseValue x = RB.RawBytes $ byteVectorFromPrim $ x
deserialiseValue (RawBytes (VP.Vector off len ba)) =
requireBytesExactly "Int64" 8 len $ indexWord8ArrayAsInt64 ba off
+{- |
+@'serialiseKey'@: \(O(1)\).
+
+@'deserialiseKey'@: \(O(1)\).
+-}
instance SerialiseKey Int where
serialiseKey x = RB.RawBytes $ byteVectorFromPrim $ byteSwapInt x
deserialiseKey (RawBytes (VP.Vector off len ba)) =
requireBytesExactly "Int" 8 len $ byteSwapInt (indexWord8ArrayAsInt ba off)
+{- |
+@'serialiseValue'@: \(O(1)\).
+
+@'deserialiseValue'@: \(O(1)\).
+-}
instance SerialiseValue Int where
serialiseValue x = RB.RawBytes $ byteVectorFromPrim $ x
@@ -211,6 +268,11 @@ instance SerialiseValue Int where
Word
-------------------------------------------------------------------------------}
+{- |
+@'serialiseKey'@: \(O(1)\).
+
+@'deserialiseKey'@: \(O(1)\).
+-}
instance SerialiseKey Word8 where
serialiseKey x = RB.RawBytes $ byteVectorFromPrim x
@@ -219,13 +281,22 @@ instance SerialiseKey Word8 where
instance SerialiseKeyOrderPreserving Word8
+{- |
+@'serialiseValue'@: \(O(1)\).
+
+@'deserialiseValue'@: \(O(1)\).
+-}
instance SerialiseValue Word8 where
serialiseValue x = RB.RawBytes $ byteVectorFromPrim $ x
deserialiseValue (RawBytes (VP.Vector off len ba)) =
requireBytesExactly "Word8" 1 len $ indexWord8Array ba off
+{- |
+@'serialiseKey'@: \(O(1)\).
+@'deserialiseKey'@: \(O(1)\).
+-}
instance SerialiseKey Word16 where
serialiseKey x = RB.RawBytes $ byteVectorFromPrim $ byteSwapWord16 x
@@ -234,12 +305,22 @@ instance SerialiseKey Word16 where
instance SerialiseKeyOrderPreserving Word16
+{- |
+@'serialiseValue'@: \(O(1)\).
+
+@'deserialiseValue'@: \(O(1)\).
+-}
instance SerialiseValue Word16 where
serialiseValue x = RB.RawBytes $ byteVectorFromPrim $ x
deserialiseValue (RawBytes (VP.Vector off len ba)) =
requireBytesExactly "Word16" 2 len $ indexWord8ArrayAsWord16 ba off
+{- |
+@'serialiseKey'@: \(O(1)\).
+
+@'deserialiseKey'@: \(O(1)\).
+-}
instance SerialiseKey Word32 where
serialiseKey x = RB.RawBytes $ byteVectorFromPrim $ byteSwapWord32 x
@@ -248,12 +329,22 @@ instance SerialiseKey Word32 where
instance SerialiseKeyOrderPreserving Word32
+{- |
+@'serialiseValue'@: \(O(1)\).
+
+@'deserialiseValue'@: \(O(1)\).
+-}
instance SerialiseValue Word32 where
serialiseValue x = RB.RawBytes $ byteVectorFromPrim $ x
deserialiseValue (RawBytes (VP.Vector off len ba)) =
requireBytesExactly "Word32" 4 len $ indexWord8ArrayAsWord32 ba off
+{- |
+@'serialiseKey'@: \(O(1)\).
+
+@'deserialiseKey'@: \(O(1)\).
+-}
instance SerialiseKey Word64 where
serialiseKey x = RB.RawBytes $ byteVectorFromPrim $ byteSwapWord64 x
@@ -262,12 +353,22 @@ instance SerialiseKey Word64 where
instance SerialiseKeyOrderPreserving Word64
+{- |
+@'serialiseValue'@: \(O(1)\).
+
+@'deserialiseValue'@: \(O(1)\).
+-}
instance SerialiseValue Word64 where
serialiseValue x = RB.RawBytes $ byteVectorFromPrim $ x
deserialiseValue (RawBytes (VP.Vector off len ba)) =
requireBytesExactly "Word64" 8 len $ indexWord8ArrayAsWord64 ba off
+{- |
+@'serialiseKey'@: \(O(1)\).
+
+@'deserialiseKey'@: \(O(1)\).
+-}
instance SerialiseKey Word where
serialiseKey x = RB.RawBytes $ byteVectorFromPrim $ byteSwapWord x
@@ -276,6 +377,11 @@ instance SerialiseKey Word where
instance SerialiseKeyOrderPreserving Word
+{- |
+@'serialiseValue'@: \(O(1)\).
+
+@'deserialiseValue'@: \(O(1)\).
+-}
instance SerialiseValue Word where
serialiseValue x = RB.RawBytes $ byteVectorFromPrim $ x
@@ -286,21 +392,29 @@ instance SerialiseValue Word where
String
-------------------------------------------------------------------------------}
--- | \( O(n) \) (de-)serialisation, where \(n\) is the number of characters in
--- the string. The string is encoded using UTF8.
---
--- TODO: optimise, it's \( O(n) + O(n) \) where it could be \( O(n) \).
+{- |
+@'serialiseKey'@: \(O(n)\).
+
+@'deserialiseKey'@: \(O(n)\).
+
+The 'String' is (de)serialised as UTF-8.
+-}
instance SerialiseKey String where
+ -- TODO: Optimise. The performance is \(O(n) + O(n)\) but it could be \(O(n)\).
serialiseKey = serialiseKey . UTF8.fromString
deserialiseKey = UTF8.toString . deserialiseKey
instance SerialiseKeyOrderPreserving String
--- | \( O(n) \) (de-)serialisation, where \(n\) is the number of characters in
--- the string.
---
--- TODO: optimise, it's \( O(n) + O(n) \) where it could be \( O(n) \).
+{- |
+@'serialiseKey'@: \(O(n)\).
+
+@'deserialiseKey'@: \(O(n)\).
+
+The 'String' is (de)serialiseValue as UTF-8.
+-}
instance SerialiseValue String where
+ -- TODO: Optimise. The performance is \(O(n) + O(n)\) but it could be \(O(n)\).
serialiseValue = serialiseValue . UTF8.fromString
deserialiseValue = UTF8.toString . deserialiseValue
@@ -308,42 +422,64 @@ instance SerialiseValue String where
ByteString
-------------------------------------------------------------------------------}
--- | \( O(n) \) (de-)serialisation, where \(n\) is the number of bytes
---
--- TODO: optimise, it's \( O(n) + O(n) \) where it could be \( O(n) \).
+{- |
+@'serialiseKey'@: \(O(n)\).
+
+@'deserialiseKey'@: \(O(n)\).
+-}
instance SerialiseKey LBS.ByteString where
+ -- TODO: Optimise. The performance is \(O(n) + O(n)\) but it could be \(O(n)\).
serialiseKey = serialiseKey . LBS.toStrict
deserialiseKey = B.toLazyByteString . RB.builder
instance SerialiseKeyOrderPreserving LBS.ByteString
--- | \( O(n) \) (de-)serialisation, where \(n\) is the number of bytes
---
--- TODO: optimise, it's \( O(n) + O(n) \) where it could be \( O(n) \).
+{- |
+@'serialiseValue'@: \(O(n)\).
+
+@'deserialiseValue'@: \(O(n)\).
+-}
instance SerialiseValue LBS.ByteString where
+ -- TODO: Optimise. The performance is \(O(n) + O(n)\) but it could be \(O(n)\).
serialiseValue = serialiseValue . LBS.toStrict
deserialiseValue = B.toLazyByteString . RB.builder
--- | \( O(n) \) (de-)serialisation, where \(n\) is the number of bytes
+{- |
+@'serialiseKey'@: \(O(n)\).
+
+@'deserialiseKey'@: \(O(n)\).
+-}
instance SerialiseKey BS.ByteString where
serialiseKey = serialiseKey . SBS.toShort
deserialiseKey = SBS.fromShort . deserialiseKey
instance SerialiseKeyOrderPreserving BS.ByteString
--- | \( O(n) \) (de-)serialisation, where \(n\) is the number of bytes
+{- |
+@'serialiseValue'@: \(O(n)\).
+
+@'deserialiseValue'@: \(O(n)\).
+-}
instance SerialiseValue BS.ByteString where
serialiseValue = serialiseValue . SBS.toShort
deserialiseValue = SBS.fromShort . deserialiseValue
--- | \( O(1) \) serialisation, \( O(n) \) deserialisation
+{- |
+@'serialiseKey'@: \(O(1)\).
+
+@'deserialiseKey'@: \(O(n)\).
+-}
instance SerialiseKey SBS.ShortByteString where
serialiseKey = RB.fromShortByteString
deserialiseKey = byteArrayToSBS . RB.force
instance SerialiseKeyOrderPreserving SBS.ShortByteString
--- | \( O(1) \) serialisation, \( O(n) \) deserialisation
+{- |
+@'serialiseValue'@: \(O(1)\).
+
+@'deserialiseValue'@: \(O(n)\).
+-}
instance SerialiseValue SBS.ShortByteString where
serialiseValue = RB.fromShortByteString
deserialiseValue = byteArrayToSBS . RB.force
@@ -352,12 +488,20 @@ instance SerialiseValue SBS.ShortByteString where
ByteArray
-------------------------------------------------------------------------------}
--- | \( O(1) \) serialisation, \( O(n) \) deserialisation
+{- |
+@'serialiseKey'@: \(O(1)\).
+
+@'deserialiseKey'@: \(O(n)\).
+-}
instance SerialiseKey P.ByteArray where
serialiseKey ba = RB.fromByteArray 0 (P.sizeofByteArray ba) ba
deserialiseKey = RB.force
--- | \( O(1) \) serialisation, \( O(n) \) deserialisation
+{- |
+@'serialiseValue'@: \(O(1)\).
+
+@'deserialiseValue'@: \(O(n)\).
+-}
instance SerialiseValue P.ByteArray where
serialiseValue ba = RB.fromByteArray 0 (P.sizeofByteArray ba) ba
deserialiseValue = RB.force
@@ -366,22 +510,24 @@ instance SerialiseValue P.ByteArray where
Void
-------------------------------------------------------------------------------}
--- | The 'deserialiseValue' of this instance throws. (as does e.g. 'Word64'
--- instance on invalid input.)
---
--- This instance is useful for tables without blobs.
+{- |
+This instance is intended for tables without blobs.
+
+The implementation of 'deseriValue' throws an excepValuen.
+-}
instance SerialiseValue Void where
serialiseValue = absurd
- deserialiseValue = error "deserialiseValue: Void can not be deserialised"
+ deserialiseValue = error "deserialiseValue: cannot deserialise into Void"
{-------------------------------------------------------------------------------
Sum
-------------------------------------------------------------------------------}
--- | An instance for 'Sum' which is transparent to the serialisation of @a@.
---
--- Note: If you want to serialize @Sum a@ differently than @a@, then you should
--- create another @newtype@ over 'Sum' and define your alternative serialization.
+{- |
+An instance for 'Sum' which is transparent to the serialisation of the value type.
+
+__NOTE:__ If you want to seriValue @'Sum' a@ differValuely from @a@, you must use another newtype wrapper.
+-}
instance SerialiseValue a => SerialiseValue (Sum a) where
serialiseValue (Sum v) = serialiseValue v
diff --git a/src/Database/LSMTree/Internal/Snapshot.hs b/src/Database/LSMTree/Internal/Snapshot.hs
index 2b9665d44..d2de362ff 100644
--- a/src/Database/LSMTree/Internal/Snapshot.hs
+++ b/src/Database/LSMTree/Internal/Snapshot.hs
@@ -139,7 +139,7 @@ instance NFData r => NFData (SnapLevel r) where
-- a bit subtle.
--
-- The nominal debt does not need to be stored because it can be derived based
--- on the table's write buffer size (which is stored in the snapshot's
+-- on the table's write buffer capacity (which is stored in the snapshot's
-- TableConfig), and on the level number that the merge is at (which also known
-- from the snapshot structure).
--
diff --git a/src/Database/LSMTree/Internal/Unsafe.hs b/src/Database/LSMTree/Internal/Unsafe.hs
index 97ceedb4a..3679a41c2 100644
--- a/src/Database/LSMTree/Internal/Unsafe.hs
+++ b/src/Database/LSMTree/Internal/Unsafe.hs
@@ -1719,7 +1719,7 @@ supplyUnionCredits resolve t credits = do
Union mt _ -> do
let conf = tableConfig t
let AllocNumEntries x = confWriteBufferAlloc conf
- -- We simply use the write buffer size as merge credit threshold, as
+ -- We simply use the write buffer capacity as merge credit threshold, as
-- the regular level merges also do.
-- TODO: pick a more suitable threshold or make configurable?
let thresh = MR.CreditThreshold (MR.UnspentCredits (MergeCredits x))
diff --git a/test/Test/Util/FS.hs b/test/Test/Util/FS.hs
index d941c235c..384fd9b00 100644
--- a/test/Test/Util/FS.hs
+++ b/test/Test/Util/FS.hs
@@ -264,7 +264,7 @@ assertNumOpenHandles fs m =
--
-- Equality is checked as follows:
-- * Infinite streams are equal: any infinity is as good as another infinity
--- * Finite streams are are checked for pointwise equality on their elements.
+-- * Finite streams are checked for pointwise equality on their elements.
-- * Other streams are trivially unequal: they do not have matching finiteness
--
-- This approximate equality satisfies the __Reflexivity__, __Symmetry__,