diff --git a/README.md b/README.md index d4edc0129..3c8f20d56 100644 --- a/README.md +++ b/README.md @@ -104,37 +104,39 @@ The documentation provides two measures of complexity: The complexities are described in terms of the following variables and constants: -- The variable *n* refers to the number of *physical* table entries. A +- The variable $`n`$ refers to the number of *physical* table entries. A *physical* table entry is any key–operation pair, e.g., `Insert k v` or `Delete k`, whereas a *logical* table entry is determined by all - physical entries with the same key. If the variable *n* is used to + physical entries with the same key. If the variable $`n`$ is used to describe the complexity of an operation that involves multiple tables, it refers to the sum of all table entries. -- The variable *o* refers to the number of open tables and cursors in +- The variable $`o`$ refers to the number of open tables and cursors in the session. -- The variable *s* refers to the number of snapshots in the session. +- The variable $`s`$ refers to the number of snapshots in the session. -- The variable *b* usually refers to the size of a batch of +- The variable $`b`$ usually refers to the size of a batch of inputs/outputs. Its precise meaning is explained for each occurrence. -- The constant *B* refers to the size of the write buffer, which is a - configuration parameter. +- The constant $`B`$ refers to the size of the write buffer, which is + determined by the `TableConfig` parameter `confWriteBufferAlloc`. -- The constant *T* refers to the size ratio of the table, which is a - configuration parameter. +- The constant $`T`$ refers to the size ratio of the table, which is + determined by the `TableConfig` parameter `confSizeRatio`. -- The constant *P* refers to the the average number of key–value pairs +- The constant $`P`$ refers to the average number of key–value pairs that fit in a page of memory. #### Disk I/O cost of operations -The following table summarises the cost of the operations on LSM-trees -measured in the number of disk I/O operations. If the cost depends on -the merge policy or merge schedule, then the table contains one entry -for each relevant combination. Otherwise, the merge policy and/or merge -schedule is listed as N/A. +The following table summarises the worst-case cost of the operations on +LSM-trees measured in the number of disk I/O operations. If the cost +depends on the merge policy or merge schedule, then the table contains +one entry for each relevant combination. Otherwise, the merge policy +and/or merge schedule is listed as N/A. The merge policy and merge +schedule are determined by the `TableConfig` parameters +`confMergePolicy` and `confMergeSchedule`. @@ -143,7 +145,7 @@ schedule is listed as N/A. - + @@ -273,84 +275,377 @@ schedule is listed as N/A.
Operation Merge policy Merge scheduleCost in disk I/O operationsWorst-case disk I/O complexity
-(\*The variable *b* refers to the number of entries retrieved by the +(\*The variable $`b`$ refers to the number of entries retrieved by the range lookup.) -TODO: Document the average-case behaviour of lookups. +#### Table Size -#### In-memory size of tables +The in-memory and the on-disk size of an LSM-tree scale *linearly* with +the number of physical entries. However, the in-memory size is smaller +by a significant factor. Let us look at a table that uses the default +configuration and has 100 million entries with 34 byte keys and 60 byte +values. The total size of 100 million key–value pairs is approximately +8.75GiB. Hence, the on-disk size would be at least 8.75GiB, not counting +the overhead for metadata. -The in-memory size of an LSM-tree is described in terms of the variable -*n*, which refers to the number of *physical* database entries. A -*physical* database entry is any key–operation pair, e.g., `Insert k v` -or `Delete k`, whereas a *logical* database entry is determined by all -physical entries with the same key. +The in-memory size would be approximately 265.39MiB: -The worst-case in-memory size of an LSM-tree is *O*(*n*). +- The write buffer would store at most 20,000 entries, which is + approximately 2.86MiB. -- The worst-case in-memory size of the write buffer is *O*(*B*). +- The fence-pointer indexes would store approximately 2.29 million keys, + which is approximately 9.30MiB. - The maximum size of the write buffer on the write buffer allocation - strategy, which is determined by the `confWriteBufferAlloc` field of - `TableConfig`. Regardless of write buffer allocation strategy, the - size of the write buffer may never exceed 4GiB. +- The Bloom filters would use 15.78 bits per entry, which is + approximately 188.11MiB. - `AllocNumEntries maxEntries` - The maximum size of the write buffer is the maximum number of entries - multiplied by the average size of a key–operation pair. +For a discussion of how the sizes of these components are determined by +the table configuration, see [Fine-tuning Table +Configuration](#fine_tuning "#fine_tuning"). -- The worst-case in-memory size of the Bloom filters is *O*(*n*). - - The total in-memory size of all Bloom filters is the number of bits - per physical entry multiplied by the number of physical entries. The - required number of bits per physical entry is determined by the Bloom - filter allocation strategy, which is determined by the - `confBloomFilterAlloc` field of `TableConfig`. - - `AllocFixed bitsPerPhysicalEntry` - The number of bits per physical entry is specified as - `bitsPerPhysicalEntry`. - - `AllocRequestFPR requestedFPR` - The number of bits per physical entry is determined by the requested - false-positive rate, which is specified as `requestedFPR`. - - The false-positive rate scales exponentially with the number of bits - per entry: - - | False-positive rate | Bits per entry | - |---------------------|----------------| - | 1 in 10 |  ≈ 4.77 | - | 1 in 100 |  ≈ 9.85 | - | 1 in 1, 000 |  ≈ 15.79 | - | 1 in 10, 000 |  ≈ 22.58 | - | 1 in 100, 000 |  ≈ 30.22 | - -- The worst-case in-memory size of the indexes is *O*(*n*). - - The total in-memory size of all indexes depends on the index type, - which is determined by the `confFencePointerIndex` field of - `TableConfig`. The in-memory size of the various indexes is described - in reference to the size of the database in [*memory - pages*](https://en.wikipedia.org/wiki/Page_%28computer_memory%29 "https://en.wikipedia.org/wiki/Page_%28computer_memory%29"). - - `OrdinaryIndex` - An ordinary index stores the maximum serialised key for each memory - page. The total in-memory size of all indexes is proportional to the - average size of one serialised key per memory page. - - `CompactIndex` - A compact index stores the 64 most significant bits of the minimum - serialised key for each memory page, as well as 1 bit per memory page - to resolve clashes, 1 bit per memory page to mark overflow pages, and - a negligible amount of memory for tie breakers. The total in-memory - size of all indexes is approximately 66 bits per memory page. - -The total size of an LSM-tree must not exceed 241 physical +The total size of an LSM-tree must not exceed $`2^{41}`$ physical entries. Violation of this condition *is* checked and will throw a `TableTooLargeError`. -### Implementation +#### Fine-tuning Table Configuration + +`confMergePolicy` +The *merge policy* balances the performance of lookups against the +performance of updates. Levelling favours lookups. Tiering favours +updates. Lazy levelling strikes a middle ground between levelling and +tiering, and moderately favours updates. This parameter is explicitly +referenced in the documentation of those operations it affects. + +`confSizeRatio` +The *size ratio* pushes the effects of the merge policy to the extreme. +If the size ratio is higher, levelling favours lookups more, and tiering +and lazy levelling favour updates more. This parameter is referred to as +$`T`$ in the disk I/O cost of operations. + +`confWriteBufferAlloc` +The *write buffer capacity* balances the performance of lookups and +updates against the in-memory size of the table. If the write buffer is +larger, it takes up more memory, but lookups and updates are more +efficient. This parameter is referred to as $`B`$ in the disk I/O cost +of operations. Irrespective of this parameter, the write buffer size +cannot exceed 4GiB. + +`confMergeSchedule` +The *merge schedule* balances the performance of lookups and updates +against the smooth performance of updates. The merge schedule does not +affect the performance of table unions. With the one-shot merge +schedule, lookups and updates are more efficient overall, but some +updates may take much longer than others. With the incremental merge +schedule, lookups and updates are less efficient overall, but each +update does a similar amount of work. This parameter is explicitly +referenced in the documentation of those operations it affects. + +`confBloomFilterAlloc` +The Bloom filter size balances the performance of lookups against the +in-memory size of the table. If the Bloom filters are larger, they take +up more memory, but lookup operations are more efficient. + +`confFencePointerIndex` +The *fence-pointer index type* supports two types of indexes. The +*ordinary* indexes are designed to work with any key. The *compact* +indexes are optimised for the case where the keys in the database are +uniformly distributed, e.g., when the keys are hashes. + +`confDiskCachePolicy` +The *disk cache policy* determines if lookup operations use the OS page +cache. Caching may improve the performance of lookups if database access +follows certain patterns. + +##### Fine-tuning: Merge Policy, Size Ratio, and Write Buffer Size + +The configuration parameters `confMergePolicy`, `confSizeRatio`, and +`confWriteBufferAlloc` affect how the table organises its data. To +understand what effect these parameters have, one must have a basic +understand of how an LSM-tree stores its data. The physical entries in +an LSM-tree are key–operation pairs, which pair a key with an operation +such as an `Insert` with a value or a `Delete`. These key–operation +pairs are organised into *runs*, which are sequences of key–operation +pairs sorted by their key. Runs are organised into *levels*, which are +unordered sequences or runs. Levels are organised hierarchically. Level +0 is kept in memory, and is referred to as the *write buffer*. All +subsequent levels are stored on disk, with each run stored in its own +file. The following shows an example LSM-tree layout, with each run as a +boxed sequence of keys and each level as a row. + +``` math + +\begin{array}{l:l} +\text{Level} +& +\text{Data} +\\ +0 +& +\fbox{\(\texttt{4}\,\_\)} +\\ +1 +& +\fbox{\(\texttt{1}\,\texttt{3}\)} +\quad +\fbox{\(\texttt{2}\,\texttt{7}\)} +\\ +2 +& +\fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)} +\end{array} +``` + +The data in an LSM-tree is *partially sorted*: only the key–operation +pairs within each run are sorted and deduplicated. As a rule of thumb, +keeping more of the data sorted means lookup operations are faster but +update operations are slower. + +The configuration parameters `confMergePolicy`, `confSizeRatio`, and +`confWriteBufferAlloc` directly affect a table's data layout. The +parameter `confWriteBufferAlloc` determines the capacity of the write +buffer. + +`AllocNumEntries maxEntries` +The write buffer can contain at most `maxEntries` entries. The constant +$`B`$ refers to the value of `maxEntries`. Irrespective of this +parameter, the write buffer size cannot exceed 4GiB. + +The parameter `confSizeRatio` determines the ratio between the +capacities of successive levels. The constant $`T`$ refers to the value +of `confSizeRatio`. For instance, if $`B = 2`$ and $`T = 2`$, then + +``` math + +\begin{array}{l:l} +\text{Level} & \text{Capacity} +\\ +0 & B \cdot T^0 = 2 +\\ +1 & B \cdot T^1 = 4 +\\ +2 & B \cdot T^2 = 8 +\\ +\ell & B \cdot T^\ell +\end{array} +``` + +The merge policy `confMergePolicy` determines the number of runs per +level. In a *tiering* LSM-tree, each level contains $`T`$ runs. In a +*levelling* LSM-tree, each level contains one single run. The *lazy +levelling* policy uses levelling only for the last level and uses +tiering for all preceding levels. The previous example used lazy +levelling. The following examples illustrate the different merge +policies using the same data, assuming $`B = 2`$ and $`T = 2`$. + +``` math + +\begin{array}{l:l:l:l} +\text{Level} +& +\text{Tiering} +& +\text{Levelling} +& +\text{Lazy Levelling} +\\ +0 +& +\fbox{\(\texttt{4}\,\_\)} +& +\fbox{\(\texttt{4}\,\_\)} +& +\fbox{\(\texttt{4}\,\_\)} +\\ +1 +& +\fbox{\(\texttt{1}\,\texttt{3}\)} +\quad +\fbox{\(\texttt{2}\,\texttt{7}\)} +& +\fbox{\(\texttt{1}\,\texttt{2}\,\texttt{3}\,\texttt{7}\)} +& +\fbox{\(\texttt{1}\,\texttt{3}\)} +\quad +\fbox{\(\texttt{2}\,\texttt{7}\)} +\\ +2 +& +\fbox{\(\texttt{4}\,\texttt{5}\,\texttt{7}\,\texttt{8}\)} +\quad +\fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{9}\)} +& +\fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)} +& +\fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)} +\end{array} +``` + +Tiering favours the performance of updates. Levelling favours the +performance of lookups. Lazy levelling strikes a middle ground between +tiering and levelling. It favours the performance of lookup operations +for the oldest data and enables more deduplication, without the impact +that full levelling has on update operations. + +##### Fine-tuning: Merge Schedule + +The configuration parameter `confMergeSchedule` affects the worst-case +performance of lookup and update operations and the structure of runs. +Regardless of the merge schedule, the amortised disk I/O complexity of +lookups and updates is *logarithmic* in the size of the table. When the +write buffer fills up, its contents are flushed to disk as a run and +added to level 1. When some level fills up, its contents are flushed +down to the next level. Eventually, as data is flushed down, runs must +be merged. This package supports two schedules for merging: + +- Using the `OneShot` merge schedule, runs must always be kept fully + sorted and deduplicated. However, flushing a run down to the next + level may cause the next level to fill up, in which case it too must + be flushed and merged futher down. In the worst case, this can cascade + down the entire table. Consequently, the worst-case disk I/O + complexity of updates is *linear* in the size of the table. This is + unacceptable for real-time systems and other use cases where + unresponsiveness is unacceptable. + +- Using the `Incremental` merge schedule, runs can be *partially + merged*, which allows the merging work to be spead out evenly across + all update operations. This aligns the worst-case and average-case + disk I/O complexity of updates—both are *logarithmic* in the size of + the table. The cost is a small constant overhead for both lookup and + update operations. + +The merge schedule does not affect the performance of table unions. The +amortised disk I/O complexity of one-shot unions is *linear* in the size +of the tables. Instead, there are separate operations for incremental +and oneshot unions. For incremental unions, it is up to the user to +spread the merging work out evenly over time. + +##### Fine-tuning: Bloom Filter Size + +The configuration parameter `confBloomFilterAlloc` affects the size of +the Bloom filters, which balances the performance of lookups against the +in-memory size of the table. + +Tables maintain a [Bloom +filter](https://en.wikipedia.org/wiki/Bloom_filter "https://en.wikipedia.org/wiki/Bloom_filter") +in memory for each run on disk. These Bloom filter are probablilistic +datastructure that are used to track which keys are present in their +corresponding run. Querying a Bloom filter returns either "maybe" +meaning the key is possibly in the run or "no" meaning the key is +definitely not in the run. When a query returns "maybe" while the key is +*not* in the run, this is referred to as a *false positive*. While the +database executes a lookup operation, any Bloom filter query that +returns a false positive causes the database to unnecessarily read a run +from disk. The probabliliy of these spurious reads follow a [binomial +distribution](https://en.wikipedia.org/wiki/Binomial_distribution "https://en.wikipedia.org/wiki/Binomial_distribution") +$`\text{Binomial}(r,\text{FPR})`$ where $`r`$ refers to the number of +runs and $`\text{FPR}`$ refers to the false-positive rate of the Bloom +filters. Hence, the expected number of spurious reads for each lookup +operation is $`r\cdot\text{FPR}`$. The number of runs $`r`$ is +proportional to the number of physical entries in the table. Its exact +value depends on the merge policy of the table: + +`LazyLevelling` +$`r = T (\log_T\frac{n}{B} - 1) + 1`$. + +The false-positive rate scales exponentially with size of the Bloom +filters in bits per entry. + +| False-positive rate (FPR) | Bits per entry (BPE) | +|---------------------------|----------------------| +| $`1\text{ in }10`$ | $`\approx 4.77 `$ | +| $`1\text{ in }100`$ | $`\approx 9.85 `$ | +| $`1\text{ in }1{,}000`$ | $`\approx 15.78 `$ | +| $`1\text{ in }10{,}000`$ | $`\approx 22.57 `$ | +| $`1\text{ in }100{,}000`$ | $`\approx 30.22 `$ | + +The configuration parameter `confBloomFilterAlloc` can be specified in +two ways: + +`AllocFixed bitsPerEntry` +Allocate the requested number of bits per entry in the table. + +The value must strictly positive, but fractional values are permitted. +The recommended range is $`[2, 24]`$. + +`AllocRequestFPR falsePositiveRate` +Allocate the required number of bits per entry to get the requested +false-positive rate. + +The value must be in the range $`(0, 1)`$. The recommended range is +$`[1\mathrm{e}{ -5 },1\mathrm{e}{ -2 }]`$. + +The total in-memory size of all Bloom filters scales *linearly* with the +number of physical entries in the table and is determined by the number +of physical entries multiplied by the number of bits per physical entry, +i.e, $`n\cdot\text{BPE}`$. Let us consider a table with 100 million +physical entries which uses the default table configuration for every +parameter other than the Bloom filter size. + +| False-positive rate (FPR) | Bloom filter size | Expected spurious reads per lookup | +|----|----|----| +| $`1\text{ in }10`$ | $` 56.86\text{MiB}`$ | $` 2.56\text{ spurious reads every lookup }`$ | +| $`1\text{ in }100`$ | $`117.42\text{MiB}`$ | $` 1 \text{ spurious read every } 3.91\text{ lookups }`$ | +| $`1\text{ in }1{,}000`$ | $`188.11\text{MiB}`$ | $` 1 \text{ spurious read every } 39.10\text{ lookups }`$ | +| $`1\text{ in }10{,}000`$ | $`269.06\text{MiB}`$ | $` 1 \text{ spurious read every } 391.01\text{ lookups }`$ | +| $`1\text{ in }100{,}000`$ | $`360.25\text{MiB}`$ | $` 1 \text{ spurious read every } 3910.19\text{ lookups }`$ | + +##### Fine-tuning: Fence-Pointer Index Type + +The configuration parameter `confFencePointerIndex` affects the type and +size of the fence-pointer indexes. Tables maintain a fence-pointer index +in memory for each run on disk. These fence-pointer indexes store the +keys at the boundaries of each page of memory to ensure that each lookup +has to read at most one page of memory from each run. Tables support two +types of fence-pointer indexes: + +`OrdinaryIndex` +Ordinary indexes are designed for any use case. + +Ordinary indexes store one serialised key per page of memory. The total +in-memory size of all indexes is $`K \cdot \frac{n}{P}`$ bits, where +$`K`$ refers to the average size of a serialised key in bits. + +`CompactIndex` +Compact indexes are designed for the use case where the keys in the +table are uniformly distributed, such as when using hashes. + +Compact indexes store the 64 most significant bits of the minimum +serialised key of each page of memory. This requires that serialised +keys are *at least* 64 bits in size. Compact indexes store 1 additional +bit per page of memory to resolve collisions, 1 additional bit per page +of memory to mark entries that are larger than one page, and a +negligible amount of memory for tie breakers. The total in-memory size +of all indexes is $`66 \cdot \frac{n}{P}`$ bits. + +##### Fine-tuning: Disk Cache Policy + +The configuration parameter `confDiskCachePolicy` determines how the +database uses the OS page cache. This may improve performance if the +database's *access pattern* has good *temporal locality* or good +*spatial locality*. The database's access pattern refers to the pattern +by which entries are accessed by lookup operations. An access pattern +has good temporal locality if it is likely to access entries that were +recently accessed or updated. An access pattern has good spatial +locality if it is likely to access entries that have nearby keys. + +- Use the `DiskCacheAll` policy if the database's access pattern has + either good spatial locality or both good spatial and temporal + locality. + +- Use the `DiskCacheLevelOneTo l` policy if the database's access + pattern has good temporal locality for updates only. The variable `l` + determines the number of levels that are cached. For a description of + levels, see [Merge Policy, Size Ratio, and Write Buffer + Size](#fine_tuning_data_layout "#fine_tuning_data_layout"). With this + setting, the database can be expected to cache up to $`\frac{k}{P}`$ + pages of memory, where $`k`$ refers to the number of entries that fit + in levels $`[1,l]`$ and is defined as $`\sum_{i=1}^{l}BT^{i}`$. + +- Use the `DiskCacheNone` policy if the database's access pattern has + does not have good spatial or temporal locality. For instance, if the + access pattern is uniformly random. + +### References The implementation of LSM-trees in this package draws inspiration from: diff --git a/bench/macro/lsm-tree-bench-wp8.hs b/bench/macro/lsm-tree-bench-wp8.hs index bdcae3a01..cb6485349 100644 --- a/bench/macro/lsm-tree-bench-wp8.hs +++ b/bench/macro/lsm-tree-bench-wp8.hs @@ -227,7 +227,7 @@ cmdP = O.subparser $ mconcat setupOptsP :: O.Parser SetupOpts setupOptsP = pure SetupOpts - <*> O.option O.auto (O.long "bloom-filter-alloc" <> O.value LSM.defaultBloomFilterAlloc <> O.showDefault <> O.help "Bloom filter allocation method [AllocFixed n | AllocRequestFPR d]") + <*> O.option O.auto (O.long "bloom-filter-alloc" <> O.value (LSM.confBloomFilterAlloc LSM.defaultTableConfig) <> O.showDefault <> O.help "Bloom filter allocation method [AllocFixed n | AllocRequestFPR d]") runOptsP :: O.Parser RunOpts runOptsP = pure RunOpts diff --git a/lsm-tree.cabal b/lsm-tree.cabal index 5454c2f99..b113ee8db 100644 --- a/lsm-tree.cabal +++ b/lsm-tree.cabal @@ -71,18 +71,21 @@ description: * The variable \(s\) refers to the number of snapshots in the session. * The variable \(b\) usually refers to the size of a batch of inputs\/outputs. Its precise meaning is explained for each occurrence. - * The constant \(B\) refers to the size of the write buffer, which is a configuration parameter. - * The constant \(T\) refers to the size ratio of the table, which is a configuration parameter. - * The constant \(P\) refers to the the average number of key–value pairs that fit in a page of memory. + * The constant \(B\) refers to the size of the write buffer, + which is determined by the @TableConfig@ parameter @confWriteBufferAlloc@. + * The constant \(T\) refers to the size ratio of the table, + which is determined by the @TableConfig@ parameter @confSizeRatio@. + * The constant \(P\) refers to the average number of key–value pairs that fit in a page of memory. === Disk I\/O cost of operations #performance_time# - The following table summarises the cost of the operations on LSM-trees measured in the number of disk I\/O operations. + The following table summarises the worst-case cost of the operations on LSM-trees measured in the number of disk I\/O operations. If the cost depends on the merge policy or merge schedule, then the table contains one entry for each relevant combination. Otherwise, the merge policy and\/or merge schedule is listed as N\/A. + The merge policy and merge schedule are determined by the @TableConfig@ parameters @confMergePolicy@ and @confMergeSchedule@. +----------+------------------------+-----------------+-----------------+------------------------------------------------+ - | Resource | Operation | Merge policy | Merge schedule | Cost in disk I\/O operations | + | Resource | Operation | Merge policy | Merge schedule | Worst-case disk I\/O complexity | +==========+========================+=================+=================+================================================+ | Session | Create\/Open | N\/A | N\/A | \(O(1)\) | +----------+------------------------+-----------------+-----------------+------------------------------------------------+ @@ -121,65 +124,312 @@ description: (*The variable \(b\) refers to the number of entries retrieved by the range lookup.) - TODO: Document the average-case behaviour of lookups. + === Table Size #performance_size# - === In-memory size of tables #performance_size# + The in-memory and the on-disk size of an LSM-tree scale /linearly/ with the number of physical entries. + However, the in-memory size is smaller by a significant factor. + Let us look at a table that uses the default configuration and has 100 million entries with 34 byte keys and 60 byte values. + The total size of 100 million key–value pairs is approximately 8.75GiB. + Hence, the on-disk size would be at least 8.75GiB, not counting the overhead for metadata. - The in-memory size of an LSM-tree is described in terms of the variable \(n\), which refers to the number of /physical/ database entries. - A /physical/ database entry is any key–operation pair, e.g., @Insert k v@ or @Delete k@, whereas a /logical/ database entry is determined by all physical entries with the same key. + The in-memory size would be approximately 265.39MiB: - The worst-case in-memory size of an LSM-tree is \(O(n)\). + * The write buffer would store at most 20,000 entries, which is approximately 2.86MiB. + * The fence-pointer indexes would store approximately 2.29 million keys, which is approximately 9.30MiB. + * The Bloom filters would use 15.78 bits per entry, which is approximately 188.11MiB. - * The worst-case in-memory size of the write buffer is \(O(B)\). - - The maximum size of the write buffer on the write buffer allocation strategy, which is determined by the @confWriteBufferAlloc@ field of @TableConfig@. - Regardless of write buffer allocation strategy, the size of the write buffer may never exceed 4GiB. - - [@AllocNumEntries maxEntries@]: - The maximum size of the write buffer is the maximum number of entries multiplied by the average size of a key–operation pair. - - * The worst-case in-memory size of the Bloom filters is \(O(n)\). - - The total in-memory size of all Bloom filters is the number of bits per physical entry multiplied by the number of physical entries. - The required number of bits per physical entry is determined by the Bloom filter allocation strategy, which is determined by the @confBloomFilterAlloc@ field of @TableConfig@. - - [@AllocFixed bitsPerPhysicalEntry@]: - The number of bits per physical entry is specified as @bitsPerPhysicalEntry@. - [@AllocRequestFPR requestedFPR@]: - The number of bits per physical entry is determined by the requested false-positive rate, which is specified as @requestedFPR@. - - The false-positive rate scales exponentially with the number of bits per entry: - - +---------------------------+---------------------+ - | False-positive rate | Bits per entry | - +===========================+=====================+ - | \(1\text{ in }10\) | \(\approx 4.77 \) | - +---------------------------+---------------------+ - | \(1\text{ in }100\) | \(\approx 9.85 \) | - +---------------------------+---------------------+ - | \(1\text{ in }1{,}000\) | \(\approx 15.79 \) | - +---------------------------+---------------------+ - | \(1\text{ in }10{,}000\) | \(\approx 22.58 \) | - +---------------------------+---------------------+ - | \(1\text{ in }100{,}000\) | \(\approx 30.22 \) | - +---------------------------+---------------------+ - - * The worst-case in-memory size of the indexes is \(O(n)\). - - The total in-memory size of all indexes depends on the index type, which is determined by the @confFencePointerIndex@ field of @TableConfig@. - The in-memory size of the various indexes is described in reference to the size of the database in [/memory pages/](https://en.wikipedia.org/wiki/Page_%28computer_memory%29). - - [@OrdinaryIndex@]: - An ordinary index stores the maximum serialised key for each memory page. - The total in-memory size of all indexes is proportional to the average size of one serialised key per memory page. - [@CompactIndex@]: - A compact index stores the 64 most significant bits of the minimum serialised key for each memory page, as well as 1 bit per memory page to resolve clashes, 1 bit per memory page to mark overflow pages, and a negligible amount of memory for tie breakers. - The total in-memory size of all indexes is approximately 66 bits per memory page. + For a discussion of how the sizes of these components are determined by the table configuration, see [Fine-tuning Table Configuration](#fine_tuning). The total size of an LSM-tree must not exceed \(2^{41}\) physical entries. Violation of this condition /is/ checked and will throw a 'TableTooLargeError'. - == Implementation + === Fine-tuning Table Configuration #fine_tuning# + + [@confMergePolicy@] + The /merge policy/ balances the performance of lookups against the performance of updates. + Levelling favours lookups. + Tiering favours updates. + Lazy levelling strikes a middle ground between levelling and tiering, and moderately favours updates. + This parameter is explicitly referenced in the documentation of those operations it affects. + + [@confSizeRatio@] + The /size ratio/ pushes the effects of the merge policy to the extreme. + If the size ratio is higher, levelling favours lookups more, and tiering and lazy levelling favour updates more. + This parameter is referred to as \(T\) in the disk I\/O cost of operations. + + [@confWriteBufferAlloc@] + The /write buffer capacity/ balances the performance of lookups and updates against the in-memory size of the table. + If the write buffer is larger, it takes up more memory, but lookups and updates are more efficient. + This parameter is referred to as \(B\) in the disk I\/O cost of operations. + Irrespective of this parameter, the write buffer size cannot exceed 4GiB. + + [@confMergeSchedule@] + The /merge schedule/ balances the performance of lookups and updates against the smooth performance of updates. + The merge schedule does not affect the performance of table unions. + With the one-shot merge schedule, lookups and updates are more efficient overall, but some updates may take much longer than others. + With the incremental merge schedule, lookups and updates are less efficient overall, but each update does a similar amount of work. + This parameter is explicitly referenced in the documentation of those operations it affects. + + [@confBloomFilterAlloc@] + The Bloom filter size balances the performance of lookups against the in-memory size of the table. + If the Bloom filters are larger, they take up more memory, but lookup operations are more efficient. + + [@confFencePointerIndex@] + The /fence-pointer index type/ supports two types of indexes. + The /ordinary/ indexes are designed to work with any key. + The /compact/ indexes are optimised for the case where the keys in the database are uniformly distributed, e.g., when the keys are hashes. + + [@confDiskCachePolicy@] + The /disk cache policy/ determines if lookup operations use the OS page cache. + Caching may improve the performance of lookups if database access follows certain patterns. + + ==== Fine-tuning: Merge Policy, Size Ratio, and Write Buffer Size #fine_tuning_data_layout# + + The configuration parameters @confMergePolicy@, @confSizeRatio@, and @confWriteBufferAlloc@ affect how the table organises its data. + To understand what effect these parameters have, one must have a basic understand of how an LSM-tree stores its data. + The physical entries in an LSM-tree are key–operation pairs, which pair a key with an operation such as an @Insert@ with a value or a @Delete@. + These key–operation pairs are organised into /runs/, which are sequences of key–operation pairs sorted by their key. + Runs are organised into /levels/, which are unordered sequences or runs. + Levels are organised hierarchically. + Level 0 is kept in memory, and is referred to as the /write buffer/. + All subsequent levels are stored on disk, with each run stored in its own file. + The following shows an example LSM-tree layout, with each run as a boxed sequence of keys and each level as a row. + + \[ + \begin{array}{l:l} + \text{Level} + & + \text{Data} + \\ + 0 + & + \fbox{\(\texttt{4}\,\_\)} + \\ + 1 + & + \fbox{\(\texttt{1}\,\texttt{3}\)} + \quad + \fbox{\(\texttt{2}\,\texttt{7}\)} + \\ + 2 + & + \fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)} + \end{array} + \] + + The data in an LSM-tree is /partially sorted/: only the key–operation pairs within each run are sorted and deduplicated. + As a rule of thumb, keeping more of the data sorted means lookup operations are faster but update operations are slower. + + The configuration parameters @confMergePolicy@, @confSizeRatio@, and @confWriteBufferAlloc@ directly affect a table's data layout. + The parameter @confWriteBufferAlloc@ determines the capacity of the write buffer. + + [@AllocNumEntries maxEntries@]: + The write buffer can contain at most @maxEntries@ entries. + The constant \(B\) refers to the value of @maxEntries@. + Irrespective of this parameter, the write buffer size cannot exceed 4GiB. + + The parameter @confSizeRatio@ determines the ratio between the capacities of successive levels. + The constant \(T\) refers to the value of @confSizeRatio@. + For instance, if \(B = 2\) and \(T = 2\), then + + \[ + \begin{array}{l:l} + \text{Level} & \text{Capacity} + \\ + 0 & B \cdot T^0 = 2 + \\ + 1 & B \cdot T^1 = 4 + \\ + 2 & B \cdot T^2 = 8 + \\ + \ell & B \cdot T^\ell + \end{array} + \] + + The merge policy @confMergePolicy@ determines the number of runs per level. + In a /tiering/ LSM-tree, each level contains \(T\) runs. + In a /levelling/ LSM-tree, each level contains one single run. + The /lazy levelling/ policy uses levelling only for the last level and uses tiering for all preceding levels. + The previous example used lazy levelling. + The following examples illustrate the different merge policies using the same data, assuming \(B = 2\) and \(T = 2\). + + \[ + \begin{array}{l:l:l:l} + \text{Level} + & + \text{Tiering} + & + \text{Levelling} + & + \text{Lazy Levelling} + \\ + 0 + & + \fbox{\(\texttt{4}\,\_\)} + & + \fbox{\(\texttt{4}\,\_\)} + & + \fbox{\(\texttt{4}\,\_\)} + \\ + 1 + & + \fbox{\(\texttt{1}\,\texttt{3}\)} + \quad + \fbox{\(\texttt{2}\,\texttt{7}\)} + & + \fbox{\(\texttt{1}\,\texttt{2}\,\texttt{3}\,\texttt{7}\)} + & + \fbox{\(\texttt{1}\,\texttt{3}\)} + \quad + \fbox{\(\texttt{2}\,\texttt{7}\)} + \\ + 2 + & + \fbox{\(\texttt{4}\,\texttt{5}\,\texttt{7}\,\texttt{8}\)} + \quad + \fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{9}\)} + & + \fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)} + & + \fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)} + \end{array} + \] + + Tiering favours the performance of updates. + Levelling favours the performance of lookups. + Lazy levelling strikes a middle ground between tiering and levelling. + It favours the performance of lookup operations for the oldest data and enables more deduplication, + without the impact that full levelling has on update operations. + + ==== Fine-tuning: Merge Schedule #fine_tuning_merge_schedule# + + The configuration parameter @confMergeSchedule@ affects the worst-case performance of lookup and update operations and the structure of runs. + Regardless of the merge schedule, the amortised disk I\/O complexity of lookups and updates is /logarithmic/ in the size of the table. + When the write buffer fills up, its contents are flushed to disk as a run and added to level 1. + When some level fills up, its contents are flushed down to the next level. + Eventually, as data is flushed down, runs must be merged. + This package supports two schedules for merging: + + * Using the @OneShot@ merge schedule, runs must always be kept fully sorted and deduplicated. + However, flushing a run down to the next level may cause the next level to fill up, + in which case it too must be flushed and merged futher down. + In the worst case, this can cascade down the entire table. + Consequently, the worst-case disk I\/O complexity of updates is /linear/ in the size of the table. + This is unacceptable for real-time systems and other use cases where unresponsiveness is unacceptable. + * Using the @Incremental@ merge schedule, runs can be /partially merged/, which allows the merging work to be spead out evenly across all update operations. + This aligns the worst-case and average-case disk I\/O complexity of updates—both are /logarithmic/ in the size of the table. + The cost is a small constant overhead for both lookup and update operations. + + The merge schedule does not affect the performance of table unions. + The amortised disk I\/O complexity of one-shot unions is /linear/ in the size of the tables. + Instead, there are separate operations for incremental and oneshot unions. + For incremental unions, it is up to the user to spread the merging work out evenly over time. + + ==== Fine-tuning: Bloom Filter Size #fine_tuning_bloom_filter_size# + + The configuration parameter @confBloomFilterAlloc@ affects the size of the Bloom filters, + which balances the performance of lookups against the in-memory size of the table. + + Tables maintain a [Bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) in memory for each run on disk. + These Bloom filter are probablilistic datastructure that are used to track which keys are present in their corresponding run. + Querying a Bloom filter returns either \"maybe\" meaning the key is possibly in the run or \"no\" meaning the key is definitely not in the run. + When a query returns \"maybe\" while the key is /not/ in the run, this is referred to as a /false positive/. + While the database executes a lookup operation, any Bloom filter query that returns a false positive causes the database to unnecessarily read a run from disk. + The probabliliy of these spurious reads follow a [binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution) \(\text{Binomial}(r,\text{FPR})\) + where \(r\) refers to the number of runs and \(\text{FPR}\) refers to the false-positive rate of the Bloom filters. + Hence, the expected number of spurious reads for each lookup operation is \(r\cdot\text{FPR}\). + The number of runs \(r\) is proportional to the number of physical entries in the table. Its exact value depends on the merge policy of the table: + + [@LazyLevelling@] + \(r = T (\log_T\frac{n}{B} - 1) + 1\). + + The false-positive rate scales exponentially with size of the Bloom filters in bits per entry. + + +---------------------------+----------------------+ + | False-positive rate (FPR) | Bits per entry (BPE) | + +===========================+======================+ + | \(1\text{ in }10\) | \(\approx 4.77 \) | + +---------------------------+----------------------+ + | \(1\text{ in }100\) | \(\approx 9.85 \) | + +---------------------------+----------------------+ + | \(1\text{ in }1{,}000\) | \(\approx 15.78 \) | + +---------------------------+----------------------+ + | \(1\text{ in }10{,}000\) | \(\approx 22.57 \) | + +---------------------------+----------------------+ + | \(1\text{ in }100{,}000\) | \(\approx 30.22 \) | + +---------------------------+----------------------+ + + The configuration parameter @confBloomFilterAlloc@ can be specified in two ways: + + [@AllocFixed bitsPerEntry@] + Allocate the requested number of bits per entry in the table. + + The value must strictly positive, but fractional values are permitted. + The recommended range is \([2, 24]\). + + [@AllocRequestFPR falsePositiveRate@] + Allocate the required number of bits per entry to get the requested false-positive rate. + + The value must be in the range \((0, 1)\). + The recommended range is \([1\mathrm{e}{ -5 },1\mathrm{e}{ -2 }]\). + + The total in-memory size of all Bloom filters scales /linearly/ with the number of physical entries in the table and is determined by the number of physical entries multiplied by the number of bits per physical entry, i.e, \(n\cdot\text{BPE}\). + Let us consider a table with 100 million physical entries which uses the default table configuration for every parameter other than the Bloom filter size. + + +---------------------------+----------------------+------------------------------------------------------------------+ + | False-positive rate (FPR) | Bloom filter size | Expected spurious reads per lookup | + +===========================+======================+==================================================================+ + | \(1\text{ in }10\) | \( 56.86\text{MiB}\) | \( 2.56\text{ spurious reads every lookup }\) | + +---------------------------+----------------------+------------------------------------------------------------------+ + | \(1\text{ in }100\) | \(117.42\text{MiB}\) | \( 1 \text{ spurious read every } 3.91\text{ lookups }\) | + +---------------------------+----------------------+------------------------------------------------------------------+ + | \(1\text{ in }1{,}000\) | \(188.11\text{MiB}\) | \( 1 \text{ spurious read every } 39.10\text{ lookups }\) | + +---------------------------+----------------------+------------------------------------------------------------------+ + | \(1\text{ in }10{,}000\) | \(269.06\text{MiB}\) | \( 1 \text{ spurious read every } 391.01\text{ lookups }\) | + +---------------------------+----------------------+------------------------------------------------------------------+ + | \(1\text{ in }100{,}000\) | \(360.25\text{MiB}\) | \( 1 \text{ spurious read every } 3910.19\text{ lookups }\) | + +---------------------------+----------------------+------------------------------------------------------------------+ + + ==== Fine-tuning: Fence-Pointer Index Type #fine_tuning_fence_pointer_index_type# + + The configuration parameter @confFencePointerIndex@ affects the type and size of the fence-pointer indexes. + Tables maintain a fence-pointer index in memory for each run on disk. + These fence-pointer indexes store the keys at the boundaries of each page of memory to ensure that each lookup has to read at most one page of memory from each run. + Tables support two types of fence-pointer indexes: + + [@OrdinaryIndex@] + Ordinary indexes are designed for any use case. + + Ordinary indexes store one serialised key per page of memory. + The total in-memory size of all indexes is \(K \cdot \frac{n}{P}\) bits, + where \(K\) refers to the average size of a serialised key in bits. + + [@CompactIndex@] + Compact indexes are designed for the use case where the keys in the table are uniformly distributed, such as when using hashes. + + Compact indexes store the 64 most significant bits of the minimum serialised key of each page of memory. + This requires that serialised keys are /at least/ 64 bits in size. + Compact indexes store 1 additional bit per page of memory to resolve collisions, 1 additional bit per page of memory to mark entries that are larger than one page, and a negligible amount of memory for tie breakers. + The total in-memory size of all indexes is \(66 \cdot \frac{n}{P}\) bits. + + ==== Fine-tuning: Disk Cache Policy #fine_tuning_disk_cache_policy# + + The configuration parameter @confDiskCachePolicy@ determines how the database uses the OS page cache. + This may improve performance if the database's /access pattern/ has good /temporal locality/ or good /spatial locality/. + The database's access pattern refers to the pattern by which entries are accessed by lookup operations. + An access pattern has good temporal locality if it is likely to access entries that were recently accessed or updated. + An access pattern has good spatial locality if it is likely to access entries that have nearby keys. + + * Use the @DiskCacheAll@ policy if the database's access pattern has either good spatial locality or both good spatial and temporal locality. + * Use the @DiskCacheLevelOneTo l@ policy if the database's access pattern has good temporal locality for updates only. + The variable @l@ determines the number of levels that are cached. + For a description of levels, see [Merge Policy, Size Ratio, and Write Buffer Size](#fine_tuning_data_layout). + With this setting, the database can be expected to cache up to \(\frac{k}{P}\) pages of memory, + where \(k\) refers to the number of entries that fit in levels \([1,l]\) and is defined as \(\sum_{i=1}^{l}BT^{i}\). + * Use the @DiskCacheNone@ policy if the database's access pattern has does not have good spatial or temporal locality. + For instance, if the access pattern is uniformly random. + + == References The implementation of LSM-trees in this package draws inspiration from: diff --git a/scripts/generate-readme.hs b/scripts/generate-readme.hs index 743203064..4fdec09fb 100755 --- a/scripts/generate-readme.hs +++ b/scripts/generate-readme.hs @@ -7,7 +7,8 @@ build-depends: , pandoc ^>=3.6.4 , text >=2.1 -} -{-# LANGUAGE LambdaCase #-} +{-# LANGUAGE LambdaCase #-} +{-# LANGUAGE OverloadedStrings #-} module Main (main) where @@ -22,7 +23,7 @@ import qualified Distribution.Types.PackageDescription as PackageDescription import Distribution.Utils.ShortText (fromShortText) import System.IO (hPutStrLn, stderr) import Text.Pandoc (runIOorExplode) -import Text.Pandoc.Extensions (githubMarkdownExtensions) +import Text.Pandoc.Extensions (getDefaultExtensions) import Text.Pandoc.Options (ReaderOptions (..), WriterOptions (..), def) import Text.Pandoc.Readers (readHaddock) @@ -45,6 +46,6 @@ main = do runIOorExplode $ do doc1 <- readHaddock def description let doc2 = headerShift 1 doc1 - writeMarkdown def{writerExtensions = githubMarkdownExtensions} doc2 + writeMarkdown def{writerExtensions = getDefaultExtensions "gfm"} doc2 let readme = T.unlines [readmeHeaderContent, body] TIO.writeFile "README.md" readme diff --git a/src/Database/LSMTree.hs b/src/Database/LSMTree.hs index 6ffbb7b02..f9fdea764 100644 --- a/src/Database/LSMTree.hs +++ b/src/Database/LSMTree.hs @@ -113,13 +113,12 @@ module Database.LSMTree ( ), defaultTableConfig, MergePolicy (LazyLevelling), + MergeSchedule (..), SizeRatio (Four), WriteBufferAlloc (AllocNumEntries), BloomFilterAlloc (AllocFixed, AllocRequestFPR), - defaultBloomFilterAlloc, FencePointerIndexType (OrdinaryIndex, CompactIndex), DiskCachePolicy (..), - MergeSchedule (..), -- ** Table Configuration Overrides #table_configuration_overrides# OverrideDiskCachePolicy (..), @@ -156,12 +155,6 @@ module Database.LSMTree ( resolveValidOutput, resolveAssociativity, - -- * Tracer - Tracer, - LSMTreeTrace (..), - TableTrace (..), - CursorTrace (..), - -- * Errors #errors# SessionDirDoesNotExistError (..), SessionDirLockedError (..), @@ -178,6 +171,24 @@ module Database.LSMTree ( BlobRefInvalidError (..), CursorClosedError (..), InvalidSnapshotNameError (..), + + -- * Traces #traces# + Tracer, + LSMTreeTrace (..), + TableTrace (..), + CursorTrace (..), + MergeTrace (..), + CursorId (..), + TableId (..), + AtLevel (..), + LevelNo (..), + NumEntries (..), + RunNumber (..), + MergePolicyForLevel (..), + LevelMergeType (..), + RunParams (..), + RunDataCaching (..), + IndexType (..), ) where import Control.Concurrent.Class.MonadMVar.Strict (MonadMVar) @@ -203,17 +214,24 @@ import qualified Database.LSMTree.Internal.BlobRef as Internal import Database.LSMTree.Internal.Config (BloomFilterAlloc (AllocFixed, AllocRequestFPR), DiskCachePolicy (..), FencePointerIndexType (..), - MergePolicy (..), MergeSchedule (..), SizeRatio (..), - TableConfig (..), WriteBufferAlloc (..), - defaultBloomFilterAlloc, defaultTableConfig, - serialiseKeyMinimalSize) + LevelNo (..), MergePolicy (..), MergeSchedule (..), + SizeRatio (..), TableConfig (..), WriteBufferAlloc (..), + defaultTableConfig, serialiseKeyMinimalSize) import Database.LSMTree.Internal.Config.Override (OverrideDiskCachePolicy (..)) +import Database.LSMTree.Internal.Entry (NumEntries (..)) import qualified Database.LSMTree.Internal.Entry as Entry +import Database.LSMTree.Internal.Merge (LevelMergeType (..)) +import Database.LSMTree.Internal.MergeSchedule (AtLevel (..), + MergePolicyForLevel (..), MergeTrace (..)) import Database.LSMTree.Internal.Paths (SnapshotName, isValidSnapshotName, toSnapshotName) import Database.LSMTree.Internal.Range (Range (..)) import Database.LSMTree.Internal.RawBytes (RawBytes (..)) +import Database.LSMTree.Internal.RunBuilder (IndexType (..), + RunDataCaching (..), RunParams (..)) +import Database.LSMTree.Internal.RunNumber (CursorId (..), + RunNumber (..), TableId (..)) import qualified Database.LSMTree.Internal.Serialise as Internal import Database.LSMTree.Internal.Serialise.Class (SerialiseKey (..), SerialiseKeyOrderPreserving, SerialiseValue (..), diff --git a/src/Database/LSMTree/Internal/Config.hs b/src/Database/LSMTree/Internal/Config.hs index 48f864875..4a86cce96 100644 --- a/src/Database/LSMTree/Internal/Config.hs +++ b/src/Database/LSMTree/Internal/Config.hs @@ -16,7 +16,6 @@ module Database.LSMTree.Internal.Config ( , WriteBufferAlloc (..) -- * Bloom filter allocation , BloomFilterAlloc (..) - , defaultBloomFilterAlloc , bloomFilterAllocForLevel -- * Fence pointer index , FencePointerIndexType (..) @@ -27,7 +26,6 @@ module Database.LSMTree.Internal.Config ( , diskCachePolicyForLevel -- * Merge schedule , MergeSchedule (..) - , defaultMergeSchedule ) where import Control.DeepSeq (NFData (..)) @@ -48,26 +46,57 @@ newtype LevelNo = LevelNo Int Table configuration -------------------------------------------------------------------------------} --- | Table configuration parameters, including LSM tree tuning parameters. --- --- Some config options are fixed (for now): --- --- * Merge policy: Tiering --- --- * Size ratio: 4 +{- | +A collection of configuration parameters for tables, which can be used to tune the performance of the table. +To construct a 'TableConfig', modify the 'defaultTableConfig', which defines reasonable defaults for all parameters. + +For a detailed discussion of fine-tuning the table configuration, see [Fine-tuning Table Configuration](../#fine_tuning). + +[@confMergePolicy :: t'MergePolicy'@] + The /merge policy/ balances the performance of lookups against the performance of updates. + Levelling favours lookups. + Tiering favours updates. + Lazy levelling strikes a middle ground between levelling and tiering, and moderately favours updates. + This parameter is explicitly referenced in the documentation of those operations it affects. + +[@confSizeRatio :: t'SizeRatio'@] + The /size ratio/ pushes the effects of the merge policy to the extreme. + If the size ratio is higher, levelling favours lookups more, and tiering and lazy levelling favour updates more. + This parameter is referred to as \(T\) in the disk I\/O cost of operations. + +[@confWriteBufferAlloc :: t'WriteBufferAlloc'@] + The /write buffer capacity/ balances the performance of lookups and updates against the in-memory size of the database. + If the write buffer is larger, it takes up more memory, but lookups and updates are more efficient. + This parameter is referred to as \(B\) in the disk I\/O cost of operations. + Irrespective of this parameter, the write buffer size cannot exceed 4GiB. + +[@confMergeSchedule :: t'MergeSchedule'@] + The /merge schedule/ balances the performance of lookups and updates against the consistency of updates. + The merge schedule does not affect the performance of table unions. + With the one-shot merge schedule, lookups and updates are more efficient overall, but some updates may take much longer than others. + With the incremental merge schedule, lookups and updates are less efficient overall, but each update does a similar amount of work. + This parameter is explicitly referenced in the documentation of those operations it affects. + +[@confBloomFilterAlloc :: t'BloomFilterAlloc'@] + The Bloom filter size balances the performance of lookups against the in-memory size of the database. + If the Bloom filters are larger, they take up more memory, but lookup operations are more efficient. + +[@confFencePointerIndex :: t'FencePointerIndexType'@] + The /fence-pointer index type/ supports two types of indexes. + The /ordinary/ indexes are designed to work with any key. + The /compact/ indexes are optimised for the case where the keys in the database are uniformly distributed, e.g., when the keys are hashes. + +[@confDiskCachePolicy :: t'DiskCachePolicy'@] + The /disk cache policy/ supports caching lookup operations using the OS page cache. + Caching may improve the performance of lookups if database access follows certain patterns. +-} data TableConfig = TableConfig { confMergePolicy :: !MergePolicy , confMergeSchedule :: !MergeSchedule - -- Size ratio between the capacities of adjacent levels. , confSizeRatio :: !SizeRatio - -- | Total number of bytes that the write buffer can use. - -- - -- The maximum is 4GiB, which should be more than enough for realistic - -- applications. , confWriteBufferAlloc :: !WriteBufferAlloc , confBloomFilterAlloc :: !BloomFilterAlloc , confFencePointerIndex :: !FencePointerIndexType - -- | The policy for caching key\/value data from disk in memory. , confDiskCachePolicy :: !DiskCachePolicy } deriving stock (Show, Eq) @@ -76,19 +105,31 @@ instance NFData TableConfig where rnf (TableConfig a b c d e f g) = rnf a `seq` rnf b `seq` rnf c `seq` rnf d `seq` rnf e `seq` rnf f `seq` rnf g --- | A reasonable default 'TableConfig'. +-- | The 'defaultTableConfig' defines reasonable defaults for all 'TableConfig' parameters. -- --- This uses a write buffer with up to 20,000 elements and a generous amount of --- memory for Bloom filters (FPR of 1%). +-- >>> confMergePolicy defaultTableConfig +-- LazyLevelling +-- >>> confMergeSchedule defaultTableConfig +-- Incremental +-- >>> confSizeRatio defaultTableConfig +-- Four +-- >>> confWriteBufferAlloc defaultTableConfig +-- AllocNumEntries 20000 +-- >>> confBloomFilterAlloc defaultTableConfig +-- AllocRequestFPR 1.0e-3 +-- >>> confFencePointerIndex defaultTableConfig +-- OrdinaryIndex +-- >>> confDiskCachePolicy defaultTableConfig +-- DiskCacheAll -- defaultTableConfig :: TableConfig defaultTableConfig = TableConfig { confMergePolicy = LazyLevelling - , confMergeSchedule = defaultMergeSchedule + , confMergeSchedule = Incremental , confSizeRatio = Four , confWriteBufferAlloc = AllocNumEntries 20_000 - , confBloomFilterAlloc = defaultBloomFilterAlloc + , confBloomFilterAlloc = AllocRequestFPR 1.0e-3 , confFencePointerIndex = OrdinaryIndex , confDiskCachePolicy = DiskCacheAll } @@ -107,12 +148,19 @@ runParamsForLevel conf@TableConfig {..} levelNo = Merge policy -------------------------------------------------------------------------------} +{- | +The /merge policy/ balances the performance of lookups against the performance of updates. +Levelling favours lookups. +Tiering favours updates. +Lazy levelling strikes a middle ground between levelling and tiering, and moderately favours updates. +This parameter is explicitly referenced in the documentation of those operations it affects. + +__NOTE:__ This package only supports lazy levelling. + +For a detailed discussion of the merge policy, see [Fine-tuning: Merge Policy, Size Ratio, and Write Buffer Size](../#fine_tuning_data_layout). +-} data MergePolicy = - -- | Use tiering on intermediate levels, and levelling on the last level. - -- This makes it easier for delete operations to disappear on the last - -- level. LazyLevelling - -- TODO: add other merge policies, like tiering and levelling. deriving stock (Eq, Show) instance NFData MergePolicy where @@ -122,6 +170,15 @@ instance NFData MergePolicy where Size ratio -------------------------------------------------------------------------------} +{- | +The /size ratio/ pushes the effects of the merge policy to the extreme. +If the size ratio is higher, levelling favours lookups more, and tiering and lazy levelling favour updates more. +This parameter is referred to as \(T\) in the disk I\/O cost of operations. + +__NOTE:__ This package only supports a size ratio of four. + +For a detailed discussion of the size ratio, see [Fine-tuning: Merge Policy, Size Ratio, and Write Buffer Size](../#fine_tuning_data_layout). +-} data SizeRatio = Four deriving stock (Eq, Show) @@ -135,53 +192,83 @@ sizeRatioInt = \case Four -> 4 Write buffer allocation -------------------------------------------------------------------------------} --- | Allocation method for the write buffer. +-- TODO: "If the sizes of values vary greatly, this can lead to unevenly sized runs on disk and unpredictable performance." + +{- | +The /write buffer capacity/ balances the performance of lookups and updates against the in-memory size of the table. +If the write buffer is larger, it takes up more memory, but lookups and updates are more efficient. +Irrespective of this parameter, the write buffer size cannot exceed 4GiB. + +For a detailed discussion of the size ratio, see [Fine-tuning: Merge Policy, Size Ratio, and Write Buffer Size](../#fine_tuning_data_layout). +-} data WriteBufferAlloc = - -- | Total number of key\/value pairs that can be present in the write - -- buffer before flushing the write buffer to disk. - -- - -- NOTE: if the sizes of values vary greatly, this can lead to wonky runs on - -- disk, and therefore unpredictable performance. + {- | + Allocate space for the in-memory write buffer to fit the requested number of entries. + This parameter is referred to as \(B\) in the disk I\/O cost of operations. + -} AllocNumEntries !Int deriving stock (Show, Eq) instance NFData WriteBufferAlloc where rnf (AllocNumEntries n) = rnf n +{------------------------------------------------------------------------------- + Merge schedule +-------------------------------------------------------------------------------} + +{- | +The /merge schedule/ balances the performance of lookups and updates against the consistency of updates. +The merge schedule does not affect the performance of table unions. +With the one-shot merge schedule, lookups and updates are more efficient overall, but some updates may take much longer than others. +With the incremental merge schedule, lookups and updates are less efficient overall, but each update does a similar amount of work. +This parameter is explicitly referenced in the documentation of those operations it affects. + +For a detailed discussion of the effect of the merge schedule, see [Fine-tuning: Merge Schedule](../#fine_tuning_merge_schedule). +-} +data MergeSchedule = + {- | + The 'OneShot' merge schedule causes the merging algorithm to complete merges immediately. + This is more efficient than the 'Incremental' merge schedule, but has an inconsistent workload. + Using the 'OneShot' merge schedule, the worst-case disk I\/O complexity of the update operations is /linear/ in the size of the table. + For real-time systems and other use cases where unresponsiveness is unacceptable, use the 'Incremental' merge schedule. + -} + OneShot + {- | + The 'Incremental' merge schedule spreads out the merging work over time. + This is less efficient than the 'OneShot' merge schedule, but has a consistent workload. + Using the 'Incremental' merge schedule, the worst-case disk I\/O complexity of the update operations is /logarithmic/ in the size of the table. + -} + | Incremental + deriving stock (Eq, Show) + +instance NFData MergeSchedule where + rnf OneShot = () + rnf Incremental = () + {------------------------------------------------------------------------------- Bloom filter allocation -------------------------------------------------------------------------------} --- | Allocation method for bloom filters. --- --- NOTE: a __physical__ database entry is a key\/operation pair that exists in a --- file, i.e., a run. Multiple physical entries that have the same key --- constitute a __logical__ database entry. --- --- There is a trade-off between bloom filter memory size, and the false --- positive rate. A higher false positive rate (FPR) leads to more unnecessary --- I\/O. As a guide, here are some points on the trade-off: --- --- * FPR of 1e-2 requires approximately 9.9 bits per element --- * FPR of 1e-3 requires approximately 15.8 bits per element --- * FPR of 1e-4 requires approximately 22.6 bits per element --- --- The policy can be specified either by fixing a FPR or by fixing the number --- of bits per entry. --- +{- | +The Bloom filter size balances the performance of lookups against the in-memory size of the table. +If the Bloom filters are larger, they take up more memory, but lookup operations are more efficient. + +For a detailed discussion of the Bloom filter size, see [Fine-tuning: Bloom Filter Size](../#fine_tuning_bloom_filter_size). +-} data BloomFilterAlloc = - -- | Allocate a fixed number of bits per physical entry in each bloom - -- filter. Non-integer values are legal. Once the number of entries is know, - -- the number of bits is rounded. - -- - -- The value must strictly positive, 0 < x. Sane values are 2 .. 24. - -- + {- | + Allocate the requested number of bits per entry in the table. + + The value must strictly positive, but fractional values are permitted. + The recommended range is \([2, 24]\). + -} AllocFixed !Double - | -- | Allocate as many bits as required per physical entry to get the requested - -- false-positive rate. Do this for each bloom filter. - -- - -- The value must be in the range 0 < x < 1. Sane values are 1e-2 .. 1e-5. - -- + | {- | + Allocate the required number of bits per entry to get the requested false-positive rate. + + The value must be in the range \((0, 1)\). + The recommended range is \([1\mathrm{e}{ -5 },1\mathrm{e}{ -2 }]\). + -} AllocRequestFPR !Double deriving stock (Show, Eq) @@ -189,9 +276,6 @@ instance NFData BloomFilterAlloc where rnf (AllocFixed n) = rnf n rnf (AllocRequestFPR fpr) = rnf fpr -defaultBloomFilterAlloc :: BloomFilterAlloc -defaultBloomFilterAlloc = AllocRequestFPR 1e-3 - bloomFilterAllocForLevel :: TableConfig -> RunLevelNo -> RunBloomFilterAlloc bloomFilterAllocForLevel conf _levelNo = case confBloomFilterAlloc conf of @@ -202,27 +286,31 @@ bloomFilterAllocForLevel conf _levelNo = Fence pointer index -------------------------------------------------------------------------------} --- | Configure the type of fence pointer index. +{- | +The /fence-pointer index type/ supports two types of indexes. +The /ordinary/ indexes are designed to work with any key. +The /compact/ indexes are optimised for the case where the keys in the database are uniformly distributed, e.g., when the keys are hashes. + +For a detailed discussion the fence-pointer index types, see [Fine-tuning: Fence-Pointer Index Type](../#fine_tuning_fence_pointer_index_type). +-} data FencePointerIndexType = - -- | Use a compact fence pointer index. - -- - -- Compact indexes are designed to work with keys that are large (for - -- example, 32 bytes long) cryptographic hashes. - -- - -- When using a compact index, it is vital that the - -- 'Database.LSMTree.Internal.Serialise.Class.serialiseKey' function - -- satisfies the following law: - -- - -- [Minimal size] @'Database.LSMTree.Internal.RawBytes.size' - -- ('Database.LSMTree.Internal.Serialise.Class.serialiseKey' x) >= 8@ - -- - -- Use 'serialiseKeyMinimalSize' to test this law. + {- | + Ordinary indexes are designed to work with any key. + + When using an ordinary index, the 'Database.LSMTree.Internal.Serialise.Class.serialiseKey' function cannot produce output larger than 64KiB. + -} + OrdinaryIndex + | {- | + Compact indexes are designed for the case where the keys in the database are uniformly distributed, e.g., when the keys are hashes. + + When using a compact index, the 'Database.LSMTree.Internal.Serialise.Class.serialiseKey' function must satisfy the following additional law: + + [Minimal size] + @'Database.LSMTree.Internal.RawBytes.size' ('Database.LSMTree.Internal.Serialise.Class.serialiseKey' x) >= 8@ + + Use 'serialiseKeyMinimalSize' to test this law. + -} CompactIndex - -- | Use an ordinary fence pointer index - -- - -- Ordinary indexes do not have any constraints on keys other than that - -- their serialised forms may not be 64 KiB or more in size. - | OrdinaryIndex deriving stock (Eq, Show) instance NFData FencePointerIndexType where @@ -241,48 +329,41 @@ serialiseKeyMinimalSize x = RB.size (serialiseKey x) >= 8 Disk cache policy -------------------------------------------------------------------------------} --- | The policy for caching data from disk in memory (using the OS page cache). --- --- Caching data in memory can improve performance if the access pattern has --- good access locality or if the overall data size fits within memory. On the --- other hand, caching is detrimental to performance and wastes memory if the --- access pattern has poor spatial or temporal locality. --- --- This implementation is designed to have good performance using a cacheless --- policy, where main memory is used only to cache Bloom filters and indexes, --- but none of the key\/value data itself. Nevertheless, some use cases will be --- faster if some or all of the key\/value data is also cached in memory. This --- implementation does not do any custom caching of key\/value data, relying --- simply on the OS page cache. Thus caching is done in units of 4kb disk pages --- (as opposed to individual key\/value pairs for example). --- -data DiskCachePolicy = +{- | +The /disk cache policy/ determines if lookup operations use the OS page cache. +Caching may improve the performance of lookups if database access follows certain patterns. - -- | Use the OS page cache to cache any\/all key\/value data in the - -- table. - -- - -- Use this policy if the expected access pattern for the table - -- has a good spatial or temporal locality. - DiskCacheAll - - -- | Use the OS page cache to cache data in all LSMT levels from 0 to - -- a given level number. For example, use 1 to cache the first level. - -- (The write buffer is considered to be level 0.) - -- - -- Use this policy if the expected access pattern for the table - -- has good temporal locality for recently inserted keys. - | DiskCacheLevelOneTo !Int - - --TODO: Add a policy based on size in bytes rather than internal details - -- like levels. An easy use policy would be to say: "cache the first 10 - -- Mb" and have everything worked out from that. - - -- | Do not cache any key\/value data in any level (except the write - -- buffer). - -- - -- Use this policy if expected access pattern for the table has poor - -- spatial or temporal locality, such as uniform random access. - | DiskCacheNone +For a detailed discussion the disk cache policy, see [Fine-tuning: Disk Cache Policy](../#fine_tuning_disk_cache_policy). +-} +data DiskCachePolicy = + {- | + Cache all data in the table. + + Use this policy if the database's access pattern has either good spatial locality or both good spatial and temporal locality. + -} + DiskCacheAll + + | {- | + Cache the data in the freshest @l@ levels. + + Use this policy if the database's access pattern only has good temporal locality. + + The variable @l@ determines the number of levels that are cached. + For a description of levels, see [Merge Policy, Size Ratio, and Write Buffer Size](#fine_tuning_data_layout). + With this setting, the database can be expected to cache up to \(\frac{k}{P}\) pages of memory, + where \(k\) refers to the number of entries that fit in levels \([1,l]\) and is defined as \(\sum_{i=1}^{l}BT^{i}\). + -} + -- TODO: Add a policy for caching based on size in bytes, rather than exposing internal details such as levels. + -- For instance, a policy that states "cache the freshest 10MiB" + DiskCacheLevelOneTo !Int + + | {- | + Do not cache any table data. + + Use this policy if the database's access pattern has does not have good spatial or temporal locality. + For instance, if the access pattern is uniformly random. + -} + DiskCacheNone deriving stock (Show, Eq) instance NFData DiskCachePolicy where @@ -303,40 +384,3 @@ diskCachePolicyForLevel policy levelNo = RegularLevel l | l <= LevelNo n -> CacheRunData | otherwise -> NoCacheRunData UnionLevel -> NoCacheRunData - -{------------------------------------------------------------------------------- - Merge schedule --------------------------------------------------------------------------------} - --- | A configuration option that determines how merges are stepped to --- completion. This does not affect the amount of work that is done by merges, --- only how the work is spread out over time. -data MergeSchedule = - -- | Complete merges immediately when started. - -- - -- The 'OneShot' option will make the merging algorithm perform /big/ batches - -- of work in one go, so intermittent slow-downs can be expected. For use - -- cases where unresponsiveness is unacceptable, e.g. in real-time systems, - -- use 'Incremental' instead. - OneShot - -- | Schedule merges for incremental construction, and step the merge when - -- updates are performed on a table. - -- - -- The 'Incremental' option spreads out merging work over time. More - -- specifically, updates to a table can cause a /small/ batch of merge work - -- to be performed. The scheduling of these batches is designed such that - -- merges are fully completed in time for when new merges are started on the - -- same level. - | Incremental - deriving stock (Eq, Show) - -instance NFData MergeSchedule where - rnf OneShot = () - rnf Incremental = () - --- | The default 'MergeSchedule'. --- --- >>> defaultMergeSchedule --- Incremental -defaultMergeSchedule :: MergeSchedule -defaultMergeSchedule = Incremental diff --git a/src/Database/LSMTree/Internal/Config/Override.hs b/src/Database/LSMTree/Internal/Config/Override.hs index 6eac28965..a2e7d5877 100644 --- a/src/Database/LSMTree/Internal/Config/Override.hs +++ b/src/Database/LSMTree/Internal/Config/Override.hs @@ -48,10 +48,13 @@ import Database.LSMTree.Internal.Snapshot Override disk cache policy -------------------------------------------------------------------------------} --- | Override the 'DiskCachePolicy' +{- | +The 'OverrideDiskCachePolicy' can be used to override the 'DiskCachePolicy' +when opening a table from a snapshot. +-} data OverrideDiskCachePolicy = - OverrideDiskCachePolicy DiskCachePolicy - | NoOverrideDiskCachePolicy + NoOverrideDiskCachePolicy + | OverrideDiskCachePolicy DiskCachePolicy deriving stock (Show, Eq) -- | Override the disk cache policy that is stored in snapshot metadata. diff --git a/src/Database/LSMTree/Internal/MergeSchedule.hs b/src/Database/LSMTree/Internal/MergeSchedule.hs index b515753e2..0690b7188 100644 --- a/src/Database/LSMTree/Internal/MergeSchedule.hs +++ b/src/Database/LSMTree/Internal/MergeSchedule.hs @@ -494,7 +494,7 @@ updatesWithInterleavedFlushes tr conf resolve hfs hbio root uc es reg tc = do (wb', es') <- addWriteBufferEntries hfs resolve wbblobs maxn wb es -- Supply credits before flushing, so that we complete merges in time. The -- number of supplied credits is based on the size increase of the write - -- buffer, not the the number of processed entries @length es' - length es@. + -- buffer, not the number of processed entries @length es' - length es@. let numAdded = unNumEntries (WB.numEntries wb') - unNumEntries (WB.numEntries wb) supplyCredits conf (NominalCredits numAdded) (tableLevels tc) let tc' = tc { tableWriteBuffer = wb' } diff --git a/src/Database/LSMTree/Internal/Range.hs b/src/Database/LSMTree/Internal/Range.hs index 44aed84db..27421f48a 100644 --- a/src/Database/LSMTree/Internal/Range.hs +++ b/src/Database/LSMTree/Internal/Range.hs @@ -13,9 +13,13 @@ import Control.DeepSeq (NFData (..)) -- | A range of keys. data Range k = - -- | Inclusive lower bound, exclusive upper bound + {- | + @'FromToExcluding' i j@ is the ranges from @i@ (inclusive) to @j@ (exclusive). + -} FromToExcluding k k - -- | Inclusive lower bound, inclusive upper bound + {- | + @'FromToIncluding' i j@ is the ranges from @i@ (inclusive) to @j@ (inclusive). + -} | FromToIncluding k k deriving stock (Show, Eq, Functor) diff --git a/src/Database/LSMTree/Internal/RawBytes.hs b/src/Database/LSMTree/Internal/RawBytes.hs index abc827bb1..bc4aa412f 100644 --- a/src/Database/LSMTree/Internal/RawBytes.hs +++ b/src/Database/LSMTree/Internal/RawBytes.hs @@ -69,6 +69,7 @@ import Prelude hiding (drop, take) import GHC.Exts import GHC.Stack import GHC.Word +import Text.Printf (printf) {- Note: [Export structure] ~~~~~~~~~~~~~~~~~~~~~~~ @@ -80,15 +81,30 @@ import GHC.Word Raw bytes -------------------------------------------------------------------------------} --- | Raw bytes with no alignment constraint (i.e. byte aligned), and no --- guarantee of pinned or unpinned memory (i.e. could be either). +{- | +Raw bytes. + +This type imposes no alignment constraint and provides no guarantee of whether the memory is pinned or unpinned. +-} newtype RawBytes = RawBytes (VP.Vector Word8) - deriving newtype (Show, NFData) + deriving newtype (NFData) + +-- TODO: Should we have a more well-behaved instance for 'Show'? +-- For instance, an instance that prints the bytes as a hexadecimal string? +deriving newtype instance Show RawBytes + +_showBytesAsHex :: RawBytes -> ShowS +_showBytesAsHex (RawBytes bytes) = VP.foldr ((.) . showByte) id bytes + where + showByte :: Word8 -> ShowS + showByte = showString . printf "%02x" instance Eq RawBytes where bs1 == bs2 = compareBytes bs1 bs2 == EQ --- | Lexicographical 'Ord' instance. +{- | +This instance uses lexicographic ordering. +-} instance Ord RawBytes where compare = compareBytes @@ -113,6 +129,11 @@ instance Hashable RawBytes where hash :: Word64 -> RawBytes -> Word64 hash salt (RawBytes (VP.Vector off len ba)) = hashByteArray ba off len salt +{- | +@'fromList'@: \(O(n)\). + +@'toList'@: \(O(n)\). +-} instance IsList RawBytes where type Item RawBytes = Word8 @@ -122,9 +143,13 @@ instance IsList RawBytes where toList :: RawBytes -> [Item RawBytes] toList = unpack --- | Mostly to make test cases shorter to write. +{- | +@'fromString'@: \(O(n)\). + +__Warning:__ 'fromString' truncates multi-byte characters to octets. e.g. \"枯朶に烏のとまりけり秋の暮\" becomes \"�6k�nh~�Q��n�\". +-} instance IsString RawBytes where - fromString = pack . map (fromIntegral . fromEnum) + fromString = fromByteString . fromString {------------------------------------------------------------------------------- Accessors @@ -171,9 +196,19 @@ toWord64 x# = byteSwap64 (W64# x#) Construction -------------------------------------------------------------------------------} +{- | +@('<>')@: \(O(n)\). + +@'Data.Semigroup.sconcat'@: \(O(n)\). +-} instance Semigroup RawBytes where (<>) = coerce (VP.++) +{- | +@'mempty'@: \(O(1)\). + +@'mconcat'@: \(O(n)\). +-} instance Monoid RawBytes where mempty = coerce VP.empty mconcat = coerce VP.concat diff --git a/src/Database/LSMTree/Internal/Serialise/Class.hs b/src/Database/LSMTree/Internal/Serialise/Class.hs index f83904bff..314cabffc 100644 --- a/src/Database/LSMTree/Internal/Serialise/Class.hs +++ b/src/Database/LSMTree/Internal/Serialise/Class.hs @@ -42,13 +42,15 @@ import Numeric (showInt) SerialiseKey -------------------------------------------------------------------------------} --- | Serialisation of keys. --- --- Instances should satisfy the following laws: --- --- [Identity] @'deserialiseKey' ('serialiseKey' x) == x@ --- [Identity up to slicing] @'deserialiseKey' ('packSlice' prefix ('serialiseKey' x) suffix) == x@ --- +{- | Serialisation of keys. + +Instances should satisfy the following laws: + +[Identity] + @'deserialiseKey' ('serialiseKey' x) == x@ +[Identity up to slicing] + @'deserialiseKey' ('packSlice' prefix ('serialiseKey' x) suffix) == x@ +-} class SerialiseKey k where serialiseKey :: k -> RawBytes -- TODO: 'deserialiseKey' is only strictly necessary for range queries. @@ -67,26 +69,27 @@ serialiseKeyIdentityUpToSlicing :: serialiseKeyIdentityUpToSlicing prefix x suffix = deserialiseKey (packSlice prefix (serialiseKey x) suffix) == x --- | Order-preserving serialisation of keys --- --- Internally, the library sorts key\/value pairs using the ordering of --- /serialised/ keys. Range lookups and cursor reads return key\/value according --- to this ordering. As such, if serialisation does not preserve the ordering of --- /unserialised/ keys, then range lookups and cursor reads will return --- /unserialised/ keys out of order. --- --- Instances that prevent keys from being returned out of order should satisfy --- the following law: --- --- [Ordering-preserving] @x \`'compare'\` y == 'serialiseKey' x \`'compare'\` 'serialiseKey' y@ --- --- Serialised keys (raw bytes) are lexicographically ordered, which means that --- keys should be serialised into big-endian formats to satisfy the --- __Ordering-preserving__ law, --- +{- | +Order-preserving serialisation of keys. + +Table data is sorted by /serialised/ keys. +Range lookups and cursors return entries in this order. +If serialisation does not preserve the ordering of /unserialised/ keys, +then range lookups and cursors return entries out of order. + +If the 'SerialiseKey' instance for a type preserves the ordering, +then it can safely be given an instance of 'SerialiseKeyOrderPreserving'. +These should satisfy the following law: + +[Order-preserving] + @x \`'compare'\` y == 'serialiseKey' x \`'compare'\` 'serialiseKey' y@ + +Serialised keys are lexicographically ordered. +To satisfy the __Order-preserving__ law, keys should be serialised into a big-endian format. +-} class SerialiseKey k => SerialiseKeyOrderPreserving k where --- | Test the __Ordering-preserving__ law for the 'SerialiseKeyOrderPreserving' class +-- | Test the __Order-preserving__ law for the 'SerialiseKeyOrderPreserving' class serialiseKeyPreservesOrdering :: (Ord k, SerialiseKey k) => k -> k -> Bool serialiseKeyPreservesOrdering x y = x `compare` y == serialiseKey x `compare` serialiseKey y @@ -94,12 +97,16 @@ serialiseKeyPreservesOrdering x y = x `compare` y == serialiseKey x `compare` se SerialiseValue -------------------------------------------------------------------------------} --- | Serialisation of values and blobs. --- --- Instances should satisfy the following laws: --- --- [Identity] @'deserialiseValue' ('serialiseValue' x) == x@ --- [Identity up to slicing] @'deserialiseValue' ('packSlice' prefix ('serialiseValue' x) suffix) == x@ +{- | Serialisation of values and blobs. + +Instances should satisfy the following laws: + +[Identity] + @'deserialiseValue' ('serialiseValue' x) == x@ + +[Identity up to slicing] + @'deserialiseValue' ('packSlice' prefix ('serialiseValue' x) suffix) == x@ +-} class SerialiseValue v where serialiseValue :: v -> RawBytes deserialiseValue :: RawBytes -> v @@ -147,60 +154,110 @@ requireBytesExactly tyName expected actual x Int -------------------------------------------------------------------------------} +{- | +@'serialiseKey'@: \(O(1)\). + +@'deserialiseKey'@: \(O(1)\). +-} instance SerialiseKey Int8 where serialiseKey x = RB.RawBytes $ byteVectorFromPrim x deserialiseKey (RawBytes (VP.Vector off len ba)) = requireBytesExactly "Int8" 1 len $ indexInt8Array ba off +{- | +@'serialiseValue'@: \(O(1)\). + +@'deserialiseValue'@: \(O(1)\). +-} instance SerialiseValue Int8 where serialiseValue x = RB.RawBytes $ byteVectorFromPrim $ x deserialiseValue (RawBytes (VP.Vector off len ba)) = requireBytesExactly "Int8" 1 len $ indexInt8Array ba off +{- | +@'serialiseKey'@: \(O(1)\). + +@'deserialiseKey'@: \(O(1)\). +-} instance SerialiseKey Int16 where serialiseKey x = RB.RawBytes $ byteVectorFromPrim $ byteSwapInt16 x deserialiseKey (RawBytes (VP.Vector off len ba)) = requireBytesExactly "Int16" 2 len $ byteSwapInt16 (indexWord8ArrayAsInt16 ba off) +{- | +@'serialiseValue'@: \(O(1)\). + +@'deserialiseValue'@: \(O(1)\). +-} instance SerialiseValue Int16 where serialiseValue x = RB.RawBytes $ byteVectorFromPrim $ x deserialiseValue (RawBytes (VP.Vector off len ba)) = requireBytesExactly "Int16" 2 len $ indexWord8ArrayAsInt16 ba off +{- | +@'serialiseKey'@: \(O(1)\). + +@'deserialiseKey'@: \(O(1)\). +-} instance SerialiseKey Int32 where serialiseKey x = RB.RawBytes $ byteVectorFromPrim $ byteSwapInt32 x deserialiseKey (RawBytes (VP.Vector off len ba)) = requireBytesExactly "Int32" 4 len $ byteSwapInt32 (indexWord8ArrayAsInt32 ba off) +{- | +@'serialiseValue'@: \(O(1)\). + +@'deserialiseValue'@: \(O(1)\). +-} instance SerialiseValue Int32 where serialiseValue x = RB.RawBytes $ byteVectorFromPrim $ x deserialiseValue (RawBytes (VP.Vector off len ba)) = requireBytesExactly "Int32" 4 len $ indexWord8ArrayAsInt32 ba off +{- | +@'serialiseKey'@: \(O(1)\). + +@'deserialiseKey'@: \(O(1)\). +-} instance SerialiseKey Int64 where serialiseKey x = RB.RawBytes $ byteVectorFromPrim $ byteSwapInt64 x deserialiseKey (RawBytes (VP.Vector off len ba)) = requireBytesExactly "Int64" 8 len $ byteSwapInt64 (indexWord8ArrayAsInt64 ba off) +{- | +@'serialiseValue'@: \(O(1)\). + +@'deserialiseValue'@: \(O(1)\). +-} instance SerialiseValue Int64 where serialiseValue x = RB.RawBytes $ byteVectorFromPrim $ x deserialiseValue (RawBytes (VP.Vector off len ba)) = requireBytesExactly "Int64" 8 len $ indexWord8ArrayAsInt64 ba off +{- | +@'serialiseKey'@: \(O(1)\). + +@'deserialiseKey'@: \(O(1)\). +-} instance SerialiseKey Int where serialiseKey x = RB.RawBytes $ byteVectorFromPrim $ byteSwapInt x deserialiseKey (RawBytes (VP.Vector off len ba)) = requireBytesExactly "Int" 8 len $ byteSwapInt (indexWord8ArrayAsInt ba off) +{- | +@'serialiseValue'@: \(O(1)\). + +@'deserialiseValue'@: \(O(1)\). +-} instance SerialiseValue Int where serialiseValue x = RB.RawBytes $ byteVectorFromPrim $ x @@ -211,6 +268,11 @@ instance SerialiseValue Int where Word -------------------------------------------------------------------------------} +{- | +@'serialiseKey'@: \(O(1)\). + +@'deserialiseKey'@: \(O(1)\). +-} instance SerialiseKey Word8 where serialiseKey x = RB.RawBytes $ byteVectorFromPrim x @@ -219,13 +281,22 @@ instance SerialiseKey Word8 where instance SerialiseKeyOrderPreserving Word8 +{- | +@'serialiseValue'@: \(O(1)\). + +@'deserialiseValue'@: \(O(1)\). +-} instance SerialiseValue Word8 where serialiseValue x = RB.RawBytes $ byteVectorFromPrim $ x deserialiseValue (RawBytes (VP.Vector off len ba)) = requireBytesExactly "Word8" 1 len $ indexWord8Array ba off +{- | +@'serialiseKey'@: \(O(1)\). +@'deserialiseKey'@: \(O(1)\). +-} instance SerialiseKey Word16 where serialiseKey x = RB.RawBytes $ byteVectorFromPrim $ byteSwapWord16 x @@ -234,12 +305,22 @@ instance SerialiseKey Word16 where instance SerialiseKeyOrderPreserving Word16 +{- | +@'serialiseValue'@: \(O(1)\). + +@'deserialiseValue'@: \(O(1)\). +-} instance SerialiseValue Word16 where serialiseValue x = RB.RawBytes $ byteVectorFromPrim $ x deserialiseValue (RawBytes (VP.Vector off len ba)) = requireBytesExactly "Word16" 2 len $ indexWord8ArrayAsWord16 ba off +{- | +@'serialiseKey'@: \(O(1)\). + +@'deserialiseKey'@: \(O(1)\). +-} instance SerialiseKey Word32 where serialiseKey x = RB.RawBytes $ byteVectorFromPrim $ byteSwapWord32 x @@ -248,12 +329,22 @@ instance SerialiseKey Word32 where instance SerialiseKeyOrderPreserving Word32 +{- | +@'serialiseValue'@: \(O(1)\). + +@'deserialiseValue'@: \(O(1)\). +-} instance SerialiseValue Word32 where serialiseValue x = RB.RawBytes $ byteVectorFromPrim $ x deserialiseValue (RawBytes (VP.Vector off len ba)) = requireBytesExactly "Word32" 4 len $ indexWord8ArrayAsWord32 ba off +{- | +@'serialiseKey'@: \(O(1)\). + +@'deserialiseKey'@: \(O(1)\). +-} instance SerialiseKey Word64 where serialiseKey x = RB.RawBytes $ byteVectorFromPrim $ byteSwapWord64 x @@ -262,12 +353,22 @@ instance SerialiseKey Word64 where instance SerialiseKeyOrderPreserving Word64 +{- | +@'serialiseValue'@: \(O(1)\). + +@'deserialiseValue'@: \(O(1)\). +-} instance SerialiseValue Word64 where serialiseValue x = RB.RawBytes $ byteVectorFromPrim $ x deserialiseValue (RawBytes (VP.Vector off len ba)) = requireBytesExactly "Word64" 8 len $ indexWord8ArrayAsWord64 ba off +{- | +@'serialiseKey'@: \(O(1)\). + +@'deserialiseKey'@: \(O(1)\). +-} instance SerialiseKey Word where serialiseKey x = RB.RawBytes $ byteVectorFromPrim $ byteSwapWord x @@ -276,6 +377,11 @@ instance SerialiseKey Word where instance SerialiseKeyOrderPreserving Word +{- | +@'serialiseValue'@: \(O(1)\). + +@'deserialiseValue'@: \(O(1)\). +-} instance SerialiseValue Word where serialiseValue x = RB.RawBytes $ byteVectorFromPrim $ x @@ -286,21 +392,29 @@ instance SerialiseValue Word where String -------------------------------------------------------------------------------} --- | \( O(n) \) (de-)serialisation, where \(n\) is the number of characters in --- the string. The string is encoded using UTF8. --- --- TODO: optimise, it's \( O(n) + O(n) \) where it could be \( O(n) \). +{- | +@'serialiseKey'@: \(O(n)\). + +@'deserialiseKey'@: \(O(n)\). + +The 'String' is (de)serialised as UTF-8. +-} instance SerialiseKey String where + -- TODO: Optimise. The performance is \(O(n) + O(n)\) but it could be \(O(n)\). serialiseKey = serialiseKey . UTF8.fromString deserialiseKey = UTF8.toString . deserialiseKey instance SerialiseKeyOrderPreserving String --- | \( O(n) \) (de-)serialisation, where \(n\) is the number of characters in --- the string. --- --- TODO: optimise, it's \( O(n) + O(n) \) where it could be \( O(n) \). +{- | +@'serialiseKey'@: \(O(n)\). + +@'deserialiseKey'@: \(O(n)\). + +The 'String' is (de)serialiseValue as UTF-8. +-} instance SerialiseValue String where + -- TODO: Optimise. The performance is \(O(n) + O(n)\) but it could be \(O(n)\). serialiseValue = serialiseValue . UTF8.fromString deserialiseValue = UTF8.toString . deserialiseValue @@ -308,42 +422,64 @@ instance SerialiseValue String where ByteString -------------------------------------------------------------------------------} --- | \( O(n) \) (de-)serialisation, where \(n\) is the number of bytes --- --- TODO: optimise, it's \( O(n) + O(n) \) where it could be \( O(n) \). +{- | +@'serialiseKey'@: \(O(n)\). + +@'deserialiseKey'@: \(O(n)\). +-} instance SerialiseKey LBS.ByteString where + -- TODO: Optimise. The performance is \(O(n) + O(n)\) but it could be \(O(n)\). serialiseKey = serialiseKey . LBS.toStrict deserialiseKey = B.toLazyByteString . RB.builder instance SerialiseKeyOrderPreserving LBS.ByteString --- | \( O(n) \) (de-)serialisation, where \(n\) is the number of bytes --- --- TODO: optimise, it's \( O(n) + O(n) \) where it could be \( O(n) \). +{- | +@'serialiseValue'@: \(O(n)\). + +@'deserialiseValue'@: \(O(n)\). +-} instance SerialiseValue LBS.ByteString where + -- TODO: Optimise. The performance is \(O(n) + O(n)\) but it could be \(O(n)\). serialiseValue = serialiseValue . LBS.toStrict deserialiseValue = B.toLazyByteString . RB.builder --- | \( O(n) \) (de-)serialisation, where \(n\) is the number of bytes +{- | +@'serialiseKey'@: \(O(n)\). + +@'deserialiseKey'@: \(O(n)\). +-} instance SerialiseKey BS.ByteString where serialiseKey = serialiseKey . SBS.toShort deserialiseKey = SBS.fromShort . deserialiseKey instance SerialiseKeyOrderPreserving BS.ByteString --- | \( O(n) \) (de-)serialisation, where \(n\) is the number of bytes +{- | +@'serialiseValue'@: \(O(n)\). + +@'deserialiseValue'@: \(O(n)\). +-} instance SerialiseValue BS.ByteString where serialiseValue = serialiseValue . SBS.toShort deserialiseValue = SBS.fromShort . deserialiseValue --- | \( O(1) \) serialisation, \( O(n) \) deserialisation +{- | +@'serialiseKey'@: \(O(1)\). + +@'deserialiseKey'@: \(O(n)\). +-} instance SerialiseKey SBS.ShortByteString where serialiseKey = RB.fromShortByteString deserialiseKey = byteArrayToSBS . RB.force instance SerialiseKeyOrderPreserving SBS.ShortByteString --- | \( O(1) \) serialisation, \( O(n) \) deserialisation +{- | +@'serialiseValue'@: \(O(1)\). + +@'deserialiseValue'@: \(O(n)\). +-} instance SerialiseValue SBS.ShortByteString where serialiseValue = RB.fromShortByteString deserialiseValue = byteArrayToSBS . RB.force @@ -352,12 +488,20 @@ instance SerialiseValue SBS.ShortByteString where ByteArray -------------------------------------------------------------------------------} --- | \( O(1) \) serialisation, \( O(n) \) deserialisation +{- | +@'serialiseKey'@: \(O(1)\). + +@'deserialiseKey'@: \(O(n)\). +-} instance SerialiseKey P.ByteArray where serialiseKey ba = RB.fromByteArray 0 (P.sizeofByteArray ba) ba deserialiseKey = RB.force --- | \( O(1) \) serialisation, \( O(n) \) deserialisation +{- | +@'serialiseValue'@: \(O(1)\). + +@'deserialiseValue'@: \(O(n)\). +-} instance SerialiseValue P.ByteArray where serialiseValue ba = RB.fromByteArray 0 (P.sizeofByteArray ba) ba deserialiseValue = RB.force @@ -366,22 +510,24 @@ instance SerialiseValue P.ByteArray where Void -------------------------------------------------------------------------------} --- | The 'deserialiseValue' of this instance throws. (as does e.g. 'Word64' --- instance on invalid input.) --- --- This instance is useful for tables without blobs. +{- | +This instance is intended for tables without blobs. + +The implementation of 'deseriValue' throws an excepValuen. +-} instance SerialiseValue Void where serialiseValue = absurd - deserialiseValue = error "deserialiseValue: Void can not be deserialised" + deserialiseValue = error "deserialiseValue: cannot deserialise into Void" {------------------------------------------------------------------------------- Sum -------------------------------------------------------------------------------} --- | An instance for 'Sum' which is transparent to the serialisation of @a@. --- --- Note: If you want to serialize @Sum a@ differently than @a@, then you should --- create another @newtype@ over 'Sum' and define your alternative serialization. +{- | +An instance for 'Sum' which is transparent to the serialisation of the value type. + +__NOTE:__ If you want to seriValue @'Sum' a@ differValuely from @a@, you must use another newtype wrapper. +-} instance SerialiseValue a => SerialiseValue (Sum a) where serialiseValue (Sum v) = serialiseValue v diff --git a/src/Database/LSMTree/Internal/Snapshot.hs b/src/Database/LSMTree/Internal/Snapshot.hs index 2b9665d44..d2de362ff 100644 --- a/src/Database/LSMTree/Internal/Snapshot.hs +++ b/src/Database/LSMTree/Internal/Snapshot.hs @@ -139,7 +139,7 @@ instance NFData r => NFData (SnapLevel r) where -- a bit subtle. -- -- The nominal debt does not need to be stored because it can be derived based --- on the table's write buffer size (which is stored in the snapshot's +-- on the table's write buffer capacity (which is stored in the snapshot's -- TableConfig), and on the level number that the merge is at (which also known -- from the snapshot structure). -- diff --git a/src/Database/LSMTree/Internal/Unsafe.hs b/src/Database/LSMTree/Internal/Unsafe.hs index 97ceedb4a..3679a41c2 100644 --- a/src/Database/LSMTree/Internal/Unsafe.hs +++ b/src/Database/LSMTree/Internal/Unsafe.hs @@ -1719,7 +1719,7 @@ supplyUnionCredits resolve t credits = do Union mt _ -> do let conf = tableConfig t let AllocNumEntries x = confWriteBufferAlloc conf - -- We simply use the write buffer size as merge credit threshold, as + -- We simply use the write buffer capacity as merge credit threshold, as -- the regular level merges also do. -- TODO: pick a more suitable threshold or make configurable? let thresh = MR.CreditThreshold (MR.UnspentCredits (MergeCredits x)) diff --git a/test/Test/Util/FS.hs b/test/Test/Util/FS.hs index d941c235c..384fd9b00 100644 --- a/test/Test/Util/FS.hs +++ b/test/Test/Util/FS.hs @@ -264,7 +264,7 @@ assertNumOpenHandles fs m = -- -- Equality is checked as follows: -- * Infinite streams are equal: any infinity is as good as another infinity --- * Finite streams are are checked for pointwise equality on their elements. +-- * Finite streams are checked for pointwise equality on their elements. -- * Other streams are trivially unequal: they do not have matching finiteness -- -- This approximate equality satisfies the __Reflexivity__, __Symmetry__,