Final report #773

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

dcoutts wants to merge 21 commits into main from dcoutts/final-report

Collaborator

dcoutts commented Jul 1, 2025

The final write-up for the project.

We should probably also include .pdfs of some of the references (the preceding reports), and it'd probably be helpful to include a generated pdf of the final report and the integration notes.

dcoutts requested review from jorisdral, mheinzel, recursion-ninja and wenkokke as code owners

July 1, 2025 10:02

dcoutts and others added 3 commits

July 1, 2025 12:45


          Add (hopefully) final draft of the final report

273d0e3

Co-authored-by: Wolfgang Jeltsch <[email protected]>
Co-authored-by: Joris Dral <[email protected]>


          Apply the same formatting to the integration notes as final report

c8060a3

Using proper metadata, and use the same LaTeX margins, fonts etc.


          section in integration notes on security of hash based data structures

09e573a

jorisdral force-pushed the dcoutts/final-report branch from e23d0e6 to 09e573a Compare

July 1, 2025 10:45


          Add the previous reports as references

a4b291e

jeltsch force-pushed the dcoutts/final-report branch from ba1d4ed to a4b291e Compare

July 1, 2025 15:34

jeltsch and others added 7 commits

July 1, 2025 22:03


          Add references to the specification items with the targets

dcc18b8


          Polish a bit

4db6e3b


          Revise the part on meeting the memory targets

90fe4be


          Elaborate on interactions with Mithril in integration notes

cea37c3


          Tweak final report title, and add subtitle

a5f543b


          Minor edits to final report introduction

f77131f


          Final report: add a changelog

592a11e

jorisdral reviewed

View reviewed changes

Collaborator

jorisdral left a comment

Looks very good! I have a bunch of comments but mostly just clarifications

doc/final-report/integration-notes.md

+              large amount of noise.
+              Overall, our judgement is that this mitigation is practically sufficient, but
+              it merits a securit review from others who may make a different judgement. It

Collaborator

jorisdral Jul 1, 2025

"securit" -> "security"

doc/final-report/integration-notes.md

Comment on lines +186 to +189

+              Note that a per-run or per-table hash salt would incur non-trivial costs,
+              because it would reduce the sharing available in bulk Bloom filter lookups
+              (looking up N keys in M filters). The Bloom filter lookup is a performance
+              sensitive part of the overall database implementation.

Collaborator

jorisdral Jul 2, 2025

Just to make it extra clear

Suggested change

      
            Note that a per-run or per-table hash salt would incur non-trivial costs,
          
            because it would reduce the sharing available in bulk Bloom filter lookups
          
            (looking up N keys in M filters). The Bloom filter lookup is a performance
          
            sensitive part of the overall database implementation.
          
            Note that a per-run or per-table hash salt would incur non-trivial costs,
          
            because it would reduce the sharing available in bulk Bloom filter lookups
          
            (looking up N keys in M filters). Hence, we store a hash salt on a per-session basis. The Bloom filter lookup is a performance
          
            sensitive part of the overall database implementation.

doc/final-report/final-report.md Outdated

+                issued: 2021
+                title: "Storing the Cardano ledger state on disk: analysis and design options"
+                type: report
+                URL: https://github.com/IntersectMBO/lsm-tree/blob/master/doc/final-report/references/utxo-db.pdf

Collaborator

jorisdral Jul 2, 2025

Suggested change

      
              URL: https://github.com/IntersectMBO/lsm-tree/blob/master/doc/final-report/references/utxo-db.pdf
          
              URL: https://github.com/IntersectMBO/lsm-tree/blob/main/doc/final-report/references/utxo-db.pdf

Collaborator

jeltsch Jul 3, 2025

Ah, I spotted this as well, independently from you. Fixed already meanwhile. 🙂

doc/final-report/final-report.md Outdated

+                issued: 2021
+                title: "Storing the Cardano ledger state on disk: API design concepts"
+                type: report
+                URL: https://github.com/IntersectMBO/lsm-tree/blob/master/doc/final-report/references/utxo-db-api.pdf

Collaborator

jorisdral Jul 2, 2025

Suggested change

      
              URL: https://github.com/IntersectMBO/lsm-tree/blob/master/doc/final-report/references/utxo-db-api.pdf
          
              URL: https://github.com/IntersectMBO/lsm-tree/blob/main/doc/final-report/references/utxo-db-api.pdf

Collaborator

jeltsch Jul 3, 2025

Ditto.

doc/final-report/final-report.md Outdated

+                issued: 2023
+                title: "Storing the Cardano ledger state on disk: requirements for high performance backend"
+                type: report
+                URL: https://github.com/IntersectMBO/lsm-tree/blob/master/doc/final-report/references/utxo-db-lsm.pdf

Collaborator

jorisdral Jul 2, 2025

Suggested change

      
              URL: https://github.com/IntersectMBO/lsm-tree/blob/master/doc/final-report/references/utxo-db-lsm.pdf
          
              URL: https://github.com/IntersectMBO/lsm-tree/blob/main/doc/final-report/references/utxo-db-lsm.pdf

Collaborator

jeltsch Jul 3, 2025

Ditto.

doc/final-report/final-report.md Outdated

Comment on lines 1388 to 1389

		These benchmarks are implemented using `criterion`, which performs multiple
		runs and combines the results in a sound statistical manner. The reported

Collaborator

jorisdral Jul 2, 2025

Should we include an invocation for readers to run the benchmark themselves?

doc/final-report/final-report.md

Comment on lines +1248 to +1255

+               * The i7i.xlarge machine has excellent performance with one core, exceeding the
+ k target already in this setting. We know it has higher one-core
+                 performance than i8g.xlarge, with an advantage of approximately 40 % in the
+                 Bloom filter micro-benchmark. Nevertheless, it is probably limited by CPU,
+                 not by SSD, since its one-core IOPS value is so high (350 k). We know its
+                 IOPS value scales negatively when going to two cores, down to 210 k
+                 aggregated across both cores. This is probably the cause of its poor speedup:
+                 the machine goes from being limited by CPU to being limited by SSD.

Collaborator

jorisdral Jul 2, 2025

The last sentence confused me. It's not clear to me from reading the rest of the report that we get worse performance when we going from 1 core to 2 cores because of SSD limits. That seems to be a new claim that we didn't really talk about in section 4.3. For reference, there we say:

The IOPS scores scale negatively when adding more cores. This is actually quite bizarre: the
more cores used to submit I/O, the lower the aggregate IOPS that is measured. Apparently,
this is not the result of a measurement artefact but shows a real effect, and it is opposite
to what happens with physical hardware. Running fio on the i8g.xlarge machine with
4 cores results in 175 k IOPS (which is near to the rated 150 k IOPS), showing that the
negative scaling continues beyond two cores.

This only says that IOPS go down if we add more cores, but not why we think that happened.

doc/final-report/final-report.md

Comment on lines +1340 to +1343

+              * During the *run* phase, the actual benchmark is performed based on the
+                database created during the setup phase. The user can decide on the number of
+                batches, the disk cache policy and the benchmark mode: serial (the default),
+                pipelined or lookup only.

Collaborator

jorisdral Jul 2, 2025

Same here, we there are more things that are configurable. Maybe we should say: "amongst other options, the user can decide on ..."

BTW, lookup-only and serial vs. pipelined can be set independently, but the code ignores lookup-only in the pipelined case. We should maybe make these three modes part of the same option, instead of separate flags

doc/final-report/final-report.md

+              ops/sec, much more than the circa 86 k ops/sec we get with the 100 M entry
+              table. The reason is that for smaller tables there is less merging work to do.
+              ### Reproducing the results

Collaborator

jorisdral Jul 2, 2025 •

edited

Loading

Do we have a commit / tag that we used for our benchmarks, like the alpha tag for the functional requirements? That would help with reproducibility, and we can tell users to checkout that commit/tag if they want to run the benchmarks themselves

Should we say lsm-tree-0.1.0.0 once it is released?

doc/final-report/final-report.md

+              reasonable running times, we use 10 k batches of operations for the serial mode
+              and 100 k batches for the pipelined mode.
+              ### Serial and pipelined execution

Collaborator

jorisdral Jul 2, 2025

If I were reading this for the first time without prior knowledge about pipelining, I think I would understand the general idea of overlapping operations, but I fear I wouldn't fully understand the design as it is realised in our benchmark because the current description is still rather high-level. Maybe we can direct readers to the utxo-db-api document, which has a more formal explanation of pipelining and the dataflow, that might be helpful? Or maybe it's fine if the user has only a general intuition about pipelining, but then it might be helpful still to refer to the more formal description in the utxo-db-api doc


          Revise the part on the upsert benchmarks

d45fd47

jorisdral reviewed

View reviewed changes

doc/final-report/final-report.md

+              * 64 bit as the size of keys and values
+              * 80,000 elements (generated using a PRNG)
+              * Addition as the update operation

Collaborator

jorisdral Jul 3, 2025

"update" is too overloaded a term. Combined or resolved is better

This was referenced Jul 3, 2025

Final demo #757

Merged

Rename the 'wp8' benchmarks to 'utxo' benchmarks #777

Merged

jeltsch added 8 commits

July 3, 2025 16:52


          Restore 80-columns layout for paragraphs with citations

3a43516


          Improve the formatting of the metadata source

ed87915


          Add @dcoutts to references as integration notes co-author


          Restore spaces dropped by bibliography style

ad91509

The employed trick is to use no-break spaces, which is a hack but works.


          Change master to main in GitHub URLs

25e2c7d


          Fix the URL of the API documentation

7e96f24


          Add (no-break) spaces before citation references

d0b1533


          Slightly improve the beginning of the introduction

1201acc


          Add integration notes section on possible file system incompatibility…

9e93f86

… with XFS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

jeltsch jeltsch left review comments

jorisdral jorisdral left review comments

mheinzel Awaiting requested review from mheinzel mheinzel is a code owner

recursion-ninja Awaiting requested review from recursion-ninja recursion-ninja is a code owner

wenkokke Awaiting requested review from wenkokke wenkokke is a code owner

At least 1 approving review is required to merge this pull request.

Labels

None yet