diff --git a/README.md b/README.md index d4edc0129..74090afb4 100644 --- a/README.md +++ b/README.md @@ -104,28 +104,28 @@ The documentation provides two measures of complexity: The complexities are described in terms of the following variables and constants: -- The variable *n* refers to the number of *physical* table entries. A +- The variable $`n`$ refers to the number of *physical* table entries. A *physical* table entry is any key–operation pair, e.g., `Insert k v` or `Delete k`, whereas a *logical* table entry is determined by all - physical entries with the same key. If the variable *n* is used to + physical entries with the same key. If the variable $`n`$ is used to describe the complexity of an operation that involves multiple tables, it refers to the sum of all table entries. -- The variable *o* refers to the number of open tables and cursors in +- The variable $`o`$ refers to the number of open tables and cursors in the session. -- The variable *s* refers to the number of snapshots in the session. +- The variable $`s`$ refers to the number of snapshots in the session. -- The variable *b* usually refers to the size of a batch of +- The variable $`b`$ usually refers to the size of a batch of inputs/outputs. Its precise meaning is explained for each occurrence. -- The constant *B* refers to the size of the write buffer, which is a - configuration parameter. +- The constant $`B`$ refers to the size of the write buffer, which is + determined by the `TableConfig` parameter `confWriteBufferAlloc`. -- The constant *T* refers to the size ratio of the table, which is a - configuration parameter. +- The constant $`T`$ refers to the size ratio of the table, which is + determined by the `TableConfig` parameter `confSizeRatio`. -- The constant *P* refers to the the average number of key–value pairs +- The constant $`P`$ refers to the the average number of key–value pairs that fit in a page of memory. #### Disk I/O cost of operations @@ -134,7 +134,9 @@ The following table summarises the cost of the operations on LSM-trees measured in the number of disk I/O operations. If the cost depends on the merge policy or merge schedule, then the table contains one entry for each relevant combination. Otherwise, the merge policy and/or merge -schedule is listed as N/A. +schedule is listed as N/A. The merge policy and merge schedule are +determined by the `TableConfig` parameters `confMergePolicy` and +`confMergeSchedule`. @@ -273,7 +275,7 @@ schedule is listed as N/A.
-(\*The variable *b* refers to the number of entries retrieved by the +(\*The variable $`b`$ refers to the number of entries retrieved by the range lookup.) TODO: Document the average-case behaviour of lookups. @@ -281,31 +283,31 @@ TODO: Document the average-case behaviour of lookups. #### In-memory size of tables The in-memory size of an LSM-tree is described in terms of the variable -*n*, which refers to the number of *physical* database entries. A +$`n`$, which refers to the number of *physical* database entries. A *physical* database entry is any key–operation pair, e.g., `Insert k v` or `Delete k`, whereas a *logical* database entry is determined by all physical entries with the same key. -The worst-case in-memory size of an LSM-tree is *O*(*n*). +The worst-case in-memory size of an LSM-tree is $`O(n)`$. -- The worst-case in-memory size of the write buffer is *O*(*B*). +- The worst-case in-memory size of the write buffer is $`O(B)`$. The maximum size of the write buffer on the write buffer allocation - strategy, which is determined by the `confWriteBufferAlloc` field of - `TableConfig`. Regardless of write buffer allocation strategy, the - size of the write buffer may never exceed 4GiB. + strategy, which is determined by the `TableConfig` parameter + `confWriteBufferAlloc`. Regardless of write buffer allocation + strategy, the size of the write buffer may never exceed 4GiB. `AllocNumEntries maxEntries` The maximum size of the write buffer is the maximum number of entries multiplied by the average size of a key–operation pair. -- The worst-case in-memory size of the Bloom filters is *O*(*n*). +- The worst-case in-memory size of the Bloom filters is $`O(n)`$. The total in-memory size of all Bloom filters is the number of bits per physical entry multiplied by the number of physical entries. The required number of bits per physical entry is determined by the Bloom - filter allocation strategy, which is determined by the - `confBloomFilterAlloc` field of `TableConfig`. + filter allocation strategy, which is determined by the `TableConfig` + parameter `confBloomFilterAlloc`. `AllocFixed bitsPerPhysicalEntry` The number of bits per physical entry is specified as @@ -318,20 +320,20 @@ The worst-case in-memory size of an LSM-tree is *O*(*n*). The false-positive rate scales exponentially with the number of bits per entry: - | False-positive rate | Bits per entry | - |---------------------|----------------| - | 1 in 10 |  ≈ 4.77 | - | 1 in 100 |  ≈ 9.85 | - | 1 in 1, 000 |  ≈ 15.79 | - | 1 in 10, 000 |  ≈ 22.58 | - | 1 in 100, 000 |  ≈ 30.22 | + | False-positive rate | Bits per entry | + |---------------------------|--------------------| + | $`1\text{ in }10`$ | $`\approx 4.77 `$ | + | $`1\text{ in }100`$ | $`\approx 9.85 `$ | + | $`1\text{ in }1{,}000`$ | $`\approx 15.79 `$ | + | $`1\text{ in }10{,}000`$ | $`\approx 22.58 `$ | + | $`1\text{ in }100{,}000`$ | $`\approx 30.22 `$ | -- The worst-case in-memory size of the indexes is *O*(*n*). +- The worst-case in-memory size of the indexes is $`O(n)`$. The total in-memory size of all indexes depends on the index type, - which is determined by the `confFencePointerIndex` field of - `TableConfig`. The in-memory size of the various indexes is described - in reference to the size of the database in [*memory + which is determined by the `TableConfig` parameter + `confFencePointerIndex`. The in-memory size of the various indexes is + described in reference to the size of the database in [*memory pages*](https://en.wikipedia.org/wiki/Page_%28computer_memory%29 "https://en.wikipedia.org/wiki/Page_%28computer_memory%29"). `OrdinaryIndex` @@ -346,10 +348,146 @@ The worst-case in-memory size of an LSM-tree is *O*(*n*). a negligible amount of memory for tie breakers. The total in-memory size of all indexes is approximately 66 bits per memory page. -The total size of an LSM-tree must not exceed 241 physical +The total size of an LSM-tree must not exceed $`2^{41}`$ physical entries. Violation of this condition *is* checked and will throw a `TableTooLargeError`. +#### Fine-tuning Table Layout + +The configuration parameters `confMergePolicy`, `confMergeSchedule`, +`confSizeRatio`, and `confWriteBufferAlloc` affect the way in which the +table organises its data. To understand what effect these parameters +have, one must have a basic understand of how an LSM-tree stores its +data. An LSM-tree stores key–operation pairs, which pair a key with an +operation such as an `Insert` with a value or a `Delete`. These +key–operation pairs are organised into *runs*, which are sequences of +key–operation pairs sorted by their key. Runs are organised into +*levels*, which are unordered sequences or runs. Levels are organised +hierarchically. Level 0 is kept in memory, and is referred to as the +*write buffer*. All subsequent levels are stored on disk, with each run +stored in its own file. The following shows an example LSM-tree layout, +with each run as a boxed sequence of keys and each level as a row. + +``` math + +\begin{array}{l:l} +\text{Level} +& +\text{Data} +\\ +0 +& +\fbox{\(\texttt{4}\,\_\)} +\\ +1 +& +\fbox{\(\texttt{1}\,\texttt{3}\)} +\quad +\fbox{\(\texttt{2}\,\texttt{7}\)} +\\ +2 +& +\fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)} +\end{array} +``` + +The data in an LSM-tree is *partially sorted*: only the key–operation +pairs within each run are sorted and deduplicated. As a rule of thumb, +keeping more of the data sorted means lookup operations are faster but +update operations are slower. + +The configuration parameters `confMergePolicy`, `confSizeRatio`, and +`confWriteBufferAlloc` directly affect the table layout. Let $`B`$ refer +to the value of `confWriteBufferAlloc`. Let $`T`$ refer to the value of +`confiSizeRatio`. The write buffer can contain at most $`B`$ entries. +The size ratio $`T`$ determines the ratio between the maxmimum number of +entries in each level. For instance, if $`B = 2`$ and $`T = 2`$, then + +``` math + +\begin{array}{l:l} +\text{Level} & \text{Maximum Size} +\\ +0 & B \cdot T^0 = 2 +\\ +1 & B \cdot T^1 = 4 +\\ +2 & B \cdot T^2 = 8 +\\ +\ell & B \cdot T^\ell +\end{array} +``` + +The merge policy `confMergePolicy` determines the number of runs per +level. In a *tiering* LSM-tree, each level contains $`T`$ runs. In a +*levelling* LSM-tree, each level contains one single run. The *lazy +levelling* policy uses levelling only for the last level and uses +tiering for all preceding levels. The previous example used lazy +levelling. The following examples illustrate the different merge +policies using the same data. + +``` math + +\begin{array}{l:l:l:l} +\text{Level} +& +\text{Tiering} +& +\text{Levelling} +& +\text{Lazy Levelling} +\\ +0 +& +\fbox{\(\texttt{4}\,\_\)} +& +\fbox{\(\texttt{4}\,\_\)} +& +\fbox{\(\texttt{4}\,\_\)} +\\ +1 +& +\fbox{\(\texttt{1}\,\texttt{3}\)} +\quad +\fbox{\(\texttt{2}\,\texttt{7}\)} +& +\fbox{\(\texttt{1}\,\texttt{2}\,\texttt{3}\,\texttt{7}\)} +& +\fbox{\(\texttt{1}\,\texttt{3}\)} +\quad +\fbox{\(\texttt{2}\,\texttt{7}\)} +\\ +2 +& +\fbox{\(\texttt{4}\,\texttt{5}\,\texttt{7}\,\texttt{8}\)} +\quad +\fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{9}\)} +& +\fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)} +& +\fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)} +\end{array} +``` + +Tiering favours the performance of updates. Levelling favours the +performance of lookups. Lazy levelling strikes a middle ground between +tiering and levelling. It favours the performance of lookup operations +for the oldest data and enables more deduplication, without the impact +that full levelling has on update operations. + +Finally, `confMergeSchedule` affects the operation of *merges*. When the +write buffer fills up, its contents are flushed to disk as a run and +added to level 1. When some level fills up, its contents are flushed +down to the next level. Eventually, as data is flushed down, runs must +be merged. The `confMergeSchedule` determines whether these merged runs +are fully sorted and deduplicated *immediately* (`OneShot`) or +*incrementally* (`Incremental`). Using `Incremental` merges favours a +consistent workload, at the cost of making lookup operations slightly +more expensive, as they may be forced to do some merging work. Using +`OneShot` merges, lookup operations do not incur this cost, but the cost +is an inconsistent workload for update operations, as merges may cascade +all the way down the LSM-tree. + ### Implementation The implementation of LSM-trees in this package draws inspiration from: diff --git a/bench/macro/lsm-tree-bench-wp8.hs b/bench/macro/lsm-tree-bench-wp8.hs index 98517c8ac..71e99a357 100644 --- a/bench/macro/lsm-tree-bench-wp8.hs +++ b/bench/macro/lsm-tree-bench-wp8.hs @@ -227,7 +227,7 @@ cmdP = O.subparser $ mconcat setupOptsP :: O.Parser SetupOpts setupOptsP = pure SetupOpts - <*> O.option O.auto (O.long "bloom-filter-alloc" <> O.value LSM.defaultBloomFilterAlloc <> O.showDefault <> O.help "Bloom filter allocation method [AllocFixed n | AllocRequestFPR d]") + <*> O.option O.auto (O.long "bloom-filter-alloc" <> O.value (LSM.confBloomFilterAlloc LSM.defaultTableConfig) <> O.showDefault <> O.help "Bloom filter allocation method [AllocFixed n | AllocRequestFPR d]") runOptsP :: O.Parser RunOpts runOptsP = pure RunOpts diff --git a/lsm-tree.cabal b/lsm-tree.cabal index 7b1423d43..e94dfa642 100644 --- a/lsm-tree.cabal +++ b/lsm-tree.cabal @@ -71,8 +71,10 @@ description: * The variable \(s\) refers to the number of snapshots in the session. * The variable \(b\) usually refers to the size of a batch of inputs\/outputs. Its precise meaning is explained for each occurrence. - * The constant \(B\) refers to the size of the write buffer, which is a configuration parameter. - * The constant \(T\) refers to the size ratio of the table, which is a configuration parameter. + * The constant \(B\) refers to the size of the write buffer, + which is determined by the @TableConfig@ parameter @confWriteBufferAlloc@. + * The constant \(T\) refers to the size ratio of the table, + which is determined by the @TableConfig@ parameter @confSizeRatio@. * The constant \(P\) refers to the the average number of key–value pairs that fit in a page of memory. === Disk I\/O cost of operations #performance_time# @@ -80,6 +82,7 @@ description: The following table summarises the cost of the operations on LSM-trees measured in the number of disk I\/O operations. If the cost depends on the merge policy or merge schedule, then the table contains one entry for each relevant combination. Otherwise, the merge policy and\/or merge schedule is listed as N\/A. + The merge policy and merge schedule are determined by the @TableConfig@ parameters @confMergePolicy@ and @confMergeSchedule@. +----------+------------------------+-----------------+-----------------+------------------------------------------------+ | Resource | Operation | Merge policy | Merge schedule | Cost in disk I\/O operations | @@ -132,7 +135,8 @@ description: * The worst-case in-memory size of the write buffer is \(O(B)\). - The maximum size of the write buffer on the write buffer allocation strategy, which is determined by the @confWriteBufferAlloc@ field of @TableConfig@. + The maximum size of the write buffer on the write buffer allocation strategy, + which is determined by the @TableConfig@ parameter @confWriteBufferAlloc@. Regardless of write buffer allocation strategy, the size of the write buffer may never exceed 4GiB. [@AllocNumEntries maxEntries@]: @@ -141,7 +145,8 @@ description: * The worst-case in-memory size of the Bloom filters is \(O(n)\). The total in-memory size of all Bloom filters is the number of bits per physical entry multiplied by the number of physical entries. - The required number of bits per physical entry is determined by the Bloom filter allocation strategy, which is determined by the @confBloomFilterAlloc@ field of @TableConfig@. + The required number of bits per physical entry is determined by the Bloom filter allocation strategy, + which is determined by the @TableConfig@ parameter @confBloomFilterAlloc@. [@AllocFixed bitsPerPhysicalEntry@]: The number of bits per physical entry is specified as @bitsPerPhysicalEntry@. @@ -166,7 +171,8 @@ description: * The worst-case in-memory size of the indexes is \(O(n)\). - The total in-memory size of all indexes depends on the index type, which is determined by the @confFencePointerIndex@ field of @TableConfig@. + The total in-memory size of all indexes depends on the index type, + which is determined by the @TableConfig@ parameter @confFencePointerIndex@. The in-memory size of the various indexes is described in reference to the size of the database in [/memory pages/](https://en.wikipedia.org/wiki/Page_%28computer_memory%29). [@OrdinaryIndex@]: @@ -179,6 +185,127 @@ description: The total size of an LSM-tree must not exceed \(2^{41}\) physical entries. Violation of this condition /is/ checked and will throw a 'TableTooLargeError'. + === Fine-tuning Table Layout #fine_tuning# + + The configuration parameters @confMergePolicy@, @confMergeSchedule@, @confSizeRatio@, and @confWriteBufferAlloc@ affect the way in which the table organises its data. + To understand what effect these parameters have, one must have a basic understand of how an LSM-tree stores its data. + An LSM-tree stores key–operation pairs, which pair a key with an operation such as an @Insert@ with a value or a @Delete@. + These key–operation pairs are organised into /runs/, which are sequences of key–operation pairs sorted by their key. + Runs are organised into /levels/, which are unordered sequences or runs. + Levels are organised hierarchically. + Level 0 is kept in memory, and is referred to as the /write buffer/. + All subsequent levels are stored on disk, with each run stored in its own file. + The following shows an example LSM-tree layout, with each run as a boxed sequence of keys and each level as a row. + + \[ + \begin{array}{l:l} + \text{Level} + & + \text{Data} + \\ + 0 + & + \fbox{\(\texttt{4}\,\_\)} + \\ + 1 + & + \fbox{\(\texttt{1}\,\texttt{3}\)} + \quad + \fbox{\(\texttt{2}\,\texttt{7}\)} + \\ + 2 + & + \fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)} + \end{array} + \] + + The data in an LSM-tree is /partially sorted/: only the key–operation pairs within each run are sorted and deduplicated. + As a rule of thumb, keeping more of the data sorted means lookup operations are faster but update operations are slower. + + The configuration parameters @confMergePolicy@, @confSizeRatio@, and @confWriteBufferAlloc@ directly affect the table layout. + Let \(B\) refer to the value of @confWriteBufferAlloc@. + Let \(T\) refer to the value of @confiSizeRatio@. + The write buffer can contain at most \(B\) entries. + The size ratio \(T\) determines the ratio between the maxmimum number of entries in each level. + For instance, if \(B = 2\) and \(T = 2\), then + + \[ + \begin{array}{l:l} + \text{Level} & \text{Maximum Size} + \\ + 0 & B \cdot T^0 = 2 + \\ + 1 & B \cdot T^1 = 4 + \\ + 2 & B \cdot T^2 = 8 + \\ + \ell & B \cdot T^\ell + \end{array} + \] + + The merge policy @confMergePolicy@ determines the number of runs per level. + In a /tiering/ LSM-tree, each level contains \(T\) runs. + In a /levelling/ LSM-tree, each level contains one single run. + The /lazy levelling/ policy uses levelling only for the last level and uses tiering for all preceding levels. + The previous example used lazy levelling. + The following examples illustrate the different merge policies using the same data. + + \[ + \begin{array}{l:l:l:l} + \text{Level} + & + \text{Tiering} + & + \text{Levelling} + & + \text{Lazy Levelling} + \\ + 0 + & + \fbox{\(\texttt{4}\,\_\)} + & + \fbox{\(\texttt{4}\,\_\)} + & + \fbox{\(\texttt{4}\,\_\)} + \\ + 1 + & + \fbox{\(\texttt{1}\,\texttt{3}\)} + \quad + \fbox{\(\texttt{2}\,\texttt{7}\)} + & + \fbox{\(\texttt{1}\,\texttt{2}\,\texttt{3}\,\texttt{7}\)} + & + \fbox{\(\texttt{1}\,\texttt{3}\)} + \quad + \fbox{\(\texttt{2}\,\texttt{7}\)} + \\ + 2 + & + \fbox{\(\texttt{4}\,\texttt{5}\,\texttt{7}\,\texttt{8}\)} + \quad + \fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{9}\)} + & + \fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)} + & + \fbox{\(\texttt{0}\,\texttt{2}\,\texttt{3}\,\texttt{4}\,\texttt{5}\,\texttt{6}\,\texttt{8}\,\texttt{9}\)} + \end{array} + \] + + Tiering favours the performance of updates. + Levelling favours the performance of lookups. + Lazy levelling strikes a middle ground between tiering and levelling. + It favours the performance of lookup operations for the oldest data and enables more deduplication, + without the impact that full levelling has on update operations. + + Finally, @confMergeSchedule@ affects the operation of /merges/. + When the write buffer fills up, its contents are flushed to disk as a run and added to level 1. + When some level fills up, its contents are flushed down to the next level. + Eventually, as data is flushed down, runs must be merged. + The @confMergeSchedule@ determines whether these merged runs are fully sorted and deduplicated /immediately/ (@OneShot@) or /incrementally/ (@Incremental@). + Using @Incremental@ merges favours a consistent workload, at the cost of making lookup operations slightly more expensive, as they may be forced to do some merging work. + Using @OneShot@ merges, lookup operations do not incur this cost, but the cost is an inconsistent workload for update operations, as merges may cascade all the way down the LSM-tree. + == Implementation The implementation of LSM-trees in this package draws inspiration from: diff --git a/scripts/generate-readme.hs b/scripts/generate-readme.hs index 743203064..4fdec09fb 100755 --- a/scripts/generate-readme.hs +++ b/scripts/generate-readme.hs @@ -7,7 +7,8 @@ build-depends: , pandoc ^>=3.6.4 , text >=2.1 -} -{-# LANGUAGE LambdaCase #-} +{-# LANGUAGE LambdaCase #-} +{-# LANGUAGE OverloadedStrings #-} module Main (main) where @@ -22,7 +23,7 @@ import qualified Distribution.Types.PackageDescription as PackageDescription import Distribution.Utils.ShortText (fromShortText) import System.IO (hPutStrLn, stderr) import Text.Pandoc (runIOorExplode) -import Text.Pandoc.Extensions (githubMarkdownExtensions) +import Text.Pandoc.Extensions (getDefaultExtensions) import Text.Pandoc.Options (ReaderOptions (..), WriterOptions (..), def) import Text.Pandoc.Readers (readHaddock) @@ -45,6 +46,6 @@ main = do runIOorExplode $ do doc1 <- readHaddock def description let doc2 = headerShift 1 doc1 - writeMarkdown def{writerExtensions = githubMarkdownExtensions} doc2 + writeMarkdown def{writerExtensions = getDefaultExtensions "gfm"} doc2 let readme = T.unlines [readmeHeaderContent, body] TIO.writeFile "README.md" readme diff --git a/src/Database/LSMTree.hs b/src/Database/LSMTree.hs index f881e3d82..272a9a766 100644 --- a/src/Database/LSMTree.hs +++ b/src/Database/LSMTree.hs @@ -113,13 +113,12 @@ module Database.LSMTree ( ), defaultTableConfig, MergePolicy (LazyLevelling), + MergeSchedule (..), SizeRatio (Four), WriteBufferAlloc (AllocNumEntries), BloomFilterAlloc (AllocFixed, AllocRequestFPR), - defaultBloomFilterAlloc, FencePointerIndexType (OrdinaryIndex, CompactIndex), DiskCachePolicy (..), - MergeSchedule (..), -- ** Table Configuration Overrides #table_configuration_overrides# OverrideDiskCachePolicy (..), @@ -205,8 +204,7 @@ import Database.LSMTree.Internal.Config DiskCachePolicy (..), FencePointerIndexType (..), MergePolicy (..), MergeSchedule (..), SizeRatio (..), TableConfig (..), WriteBufferAlloc (..), - defaultBloomFilterAlloc, defaultTableConfig, - serialiseKeyMinimalSize) + defaultTableConfig, serialiseKeyMinimalSize) import Database.LSMTree.Internal.Config.Override (OverrideDiskCachePolicy (..)) import qualified Database.LSMTree.Internal.Entry as Entry diff --git a/src/Database/LSMTree/Internal/Config.hs b/src/Database/LSMTree/Internal/Config.hs index 29a405e8c..07dd3f2d4 100644 --- a/src/Database/LSMTree/Internal/Config.hs +++ b/src/Database/LSMTree/Internal/Config.hs @@ -16,7 +16,6 @@ module Database.LSMTree.Internal.Config ( , WriteBufferAlloc (..) -- * Bloom filter allocation , BloomFilterAlloc (..) - , defaultBloomFilterAlloc , bloomFilterAllocForLevel -- * Fence pointer index , FencePointerIndexType (..) @@ -27,7 +26,6 @@ module Database.LSMTree.Internal.Config ( , diskCachePolicyForLevel -- * Merge schedule , MergeSchedule (..) - , defaultMergeSchedule ) where import Control.DeepSeq (NFData (..)) @@ -49,26 +47,47 @@ newtype LevelNo = LevelNo Int Table configuration -------------------------------------------------------------------------------} --- | Table configuration parameters, including LSM tree tuning parameters. --- --- Some config options are fixed (for now): --- --- * Merge policy: Tiering --- --- * Size ratio: 4 +{- | +A collection of configuration parameters for tables, which can be used to tune the performance of the table. +To construct a 'TableConfig', modify the 'defaultTableConfig', which defines reasonable defaults for all parameters. +For an overview of the performance implication of the table configuration, see the [Performance](../#performance) section in the package description. + +Each configuration parameter is associated with its own type. +Detailed discussion of the use of each parameter can be found in the documentation for its associated type. + +[@confMergePolicy :: t'MergePolicy'@] + The merge policy determines how the table manages its data and affects the disk I\/O cost of some operations. + This parameter is explicitly referenced in the documentation of those operations it affects. +[@confMergeSchedule :: t'MergeSchedule'@] + The merge schedule determines how the table manages its data and affects the disk I\/O cost of some operations. + This parameter is explicitly referenced in the documentation of those operations it affects. +[@confSizeRatio :: t'SizeRatio'@] + The size ratio determines how the table manages its data and affects the disk I\/O cost of some operations. + This parameter is referred to as \(T\) in the disk I\/O cost of operations. +[@confWriteBufferAlloc :: t'WriteBufferAlloc'@] + The write buffer allocation strategy determines the maximum size of the in-memory write buffer and affects the disk I\/O cost of some operations. + This parameter is referred to as \(B\) in the disk I\/O cost of operations. + Irrespective of this parameter, the write buffer size cannot exceed 4GiB. +[@confBloomFilterAlloc :: t'BloomFilterAlloc'@] + The Bloom filter allocation strategy determines the number of bits per physical entry allocated for the Bloom filters. + This affects the in-memory size of tables. + See [In-memory size of tables](../#performance_size). +[@confFencePointerIndex :: t'FencePointerIndexType'@] + The fence pointer index type determines the type of the fence pointer indexes. + This affects the in-memory size of tables. + See [In-memory size of tables](../#performance_size). + Some values may impose additional constraints on the type of table keys. +[@confDiskCachePolicy :: t'DiskCachePolicy'@] + The disk cache policy determines the policy for caching data from disk in memory. + This may affect the performance of lookup operations. +-} data TableConfig = TableConfig { confMergePolicy :: !MergePolicy , confMergeSchedule :: !MergeSchedule - -- Size ratio between the capacities of adjacent levels. , confSizeRatio :: !SizeRatio - -- | Total number of bytes that the write buffer can use. - -- - -- The maximum is 4GiB, which should be more than enough for realistic - -- applications. , confWriteBufferAlloc :: !WriteBufferAlloc , confBloomFilterAlloc :: !BloomFilterAlloc , confFencePointerIndex :: !FencePointerIndexType - -- | The policy for caching key\/value data from disk in memory. , confDiskCachePolicy :: !DiskCachePolicy } deriving stock (Show, Eq) @@ -77,19 +96,31 @@ instance NFData TableConfig where rnf (TableConfig a b c d e f g) = rnf a `seq` rnf b `seq` rnf c `seq` rnf d `seq` rnf e `seq` rnf f `seq` rnf g --- | A reasonable default 'TableConfig'. +-- | The 'defaultTableConfig' defines reasonable defaults for all 'TableConfig' parameters. -- --- This uses a write buffer with up to 20,000 elements and a generous amount of --- memory for Bloom filters (FPR of 1%). +-- >>> confMergePolicy defaultTableConfig +-- LazyLevelling +-- >>> confMergeSchedule defaultTableConfig +-- Incremental +-- >>> confSizeRatio defaultTableConfig +-- Four +-- >>> confWriteBufferAlloc defaultTableConfig +-- AllocNumEntries 20000 +-- >>> confBloomFilterAlloc defaultTableConfig +-- AllocFixed 10 +-- >>> confFencePointerIndex defaultTableConfig +-- OrdinaryIndex +-- >>> confDiskCachePolicy defaultTableConfig +-- DiskCacheAll -- defaultTableConfig :: TableConfig defaultTableConfig = TableConfig { confMergePolicy = LazyLevelling - , confMergeSchedule = defaultMergeSchedule + , confMergeSchedule = Incremental , confSizeRatio = Four , confWriteBufferAlloc = AllocNumEntries 20_000 - , confBloomFilterAlloc = defaultBloomFilterAlloc + , confBloomFilterAlloc = AllocFixed 10 , confFencePointerIndex = OrdinaryIndex , confDiskCachePolicy = DiskCacheAll } @@ -108,17 +139,62 @@ runParamsForLevel conf@TableConfig {..} levelNo = Merge policy -------------------------------------------------------------------------------} +{- | +An LSM-tree stores its data in sorted chunks of various sizes. +The merge policy determines /when/ those chunks are merged. +This affects the performance of lookups and updates. +If chunks are merged earlier, then more of the data is sorted, which means lookups are faster. +However, if chunks +If more of the table is sorted, lookups are faster. +However, more sorting requires updates to do more work. + +Commonly, the two merge policies supported by LSM-trees are /tiering/ and /levelling/. +Tiering keeps less of the table sorted, which leads to more efficient updates. +Levelling keeps more of the table sorted, which leads to more efficient lookups. + +Currently, this package only supports the /lazy levelling/ merge policy. +See v'LazyLevelling'. +-} data MergePolicy = - -- | Use tiering on intermediate levels, and levelling on the last level. - -- This makes it easier for delete operations to disappear on the last - -- level. + {- | + The /lazy levelling/ merge policy uses tiering on the freshest half of the data and levelling on the oldest half of the data. + -} LazyLevelling - -- TODO: add other merge policies, like tiering and levelling. deriving stock (Eq, Show) instance NFData MergePolicy where rnf LazyLevelling = () +{------------------------------------------------------------------------------- + Merge schedule +-------------------------------------------------------------------------------} + +-- | A configuration option that determines how merges are stepped to +-- completion. This does not affect the amount of work that is done by merges, +-- only how the work is spread out over time. +data MergeSchedule = + -- | Complete merges immediately when started. + -- + -- The 'OneShot' option will make the merging algorithm perform /big/ batches + -- of work in one go, so intermittent slow-downs can be expected. For use + -- cases where unresponsiveness is unacceptable, e.g. in real-time systems, + -- use 'Incremental' instead. + OneShot + -- | Schedule merges for incremental construction, and step the merge when + -- updates are performed on a table. + -- + -- The 'Incremental' option spreads out merging work over time. More + -- specifically, updates to a table can cause a /small/ batch of merge work + -- to be performed. The scheduling of these batches is designed such that + -- merges are fully completed in time for when new merges are started on the + -- same level. + | Incremental + deriving stock (Eq, Show) + +instance NFData MergeSchedule where + rnf OneShot = () + rnf Incremental = () + {------------------------------------------------------------------------------- Size ratio -------------------------------------------------------------------------------} @@ -173,9 +249,6 @@ instance NFData BloomFilterAlloc where rnf (AllocFixed n) = rnf n rnf (AllocRequestFPR fpr) = rnf fpr -defaultBloomFilterAlloc :: BloomFilterAlloc -defaultBloomFilterAlloc = AllocFixed 10 - bloomFilterAllocForLevel :: TableConfig -> RunLevelNo -> RunBloomFilterAlloc bloomFilterAllocForLevel conf _levelNo = case confBloomFilterAlloc conf of @@ -287,40 +360,3 @@ diskCachePolicyForLevel policy levelNo = RegularLevel l | l <= LevelNo n -> CacheRunData | otherwise -> NoCacheRunData UnionLevel -> NoCacheRunData - -{------------------------------------------------------------------------------- - Merge schedule --------------------------------------------------------------------------------} - --- | A configuration option that determines how merges are stepped to --- completion. This does not affect the amount of work that is done by merges, --- only how the work is spread out over time. -data MergeSchedule = - -- | Complete merges immediately when started. - -- - -- The 'OneShot' option will make the merging algorithm perform /big/ batches - -- of work in one go, so intermittent slow-downs can be expected. For use - -- cases where unresponsiveness is unacceptable, e.g. in real-time systems, - -- use 'Incremental' instead. - OneShot - -- | Schedule merges for incremental construction, and step the merge when - -- updates are performed on a table. - -- - -- The 'Incremental' option spreads out merging work over time. More - -- specifically, updates to a table can cause a /small/ batch of merge work - -- to be performed. The scheduling of these batches is designed such that - -- merges are fully completed in time for when new merges are started on the - -- same level. - | Incremental - deriving stock (Eq, Show) - -instance NFData MergeSchedule where - rnf OneShot = () - rnf Incremental = () - --- | The default 'MergeSchedule'. --- --- >>> defaultMergeSchedule --- Incremental -defaultMergeSchedule :: MergeSchedule -defaultMergeSchedule = Incremental