Skip to content

Final report #773

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
273d0e3
Add (hopefully) final draft of the final report
dcoutts Jul 1, 2025
c8060a3
Apply the same formatting to the integration notes as final report
dcoutts Jul 1, 2025
09e573a
section in integration notes on security of hash based data structures
dcoutts Jul 1, 2025
a4b291e
Add the previous reports as references
dcoutts Jul 1, 2025
dcc18b8
Add references to the specification items with the targets
jeltsch Jul 1, 2025
4db6e3b
Polish a bit
jeltsch Jul 1, 2025
90fe4be
Revise the part on meeting the memory targets
jeltsch Jul 1, 2025
cea37c3
Elaborate on interactions with Mithril in integration notes
dcoutts Jul 2, 2025
a5f543b
Tweak final report title, and add subtitle
dcoutts Jul 2, 2025
f77131f
Minor edits to final report introduction
dcoutts Jul 2, 2025
592a11e
Final report: add a changelog
jorisdral Jul 2, 2025
d45fd47
Revise the part on the upsert benchmarks
jeltsch Jul 2, 2025
3a43516
Restore 80-columns layout for paragraphs with citations
jeltsch Jul 3, 2025
ed87915
Improve the formatting of the metadata source
jeltsch Jul 3, 2025
1918726
Add @dcoutts to references as integration notes co-author
jeltsch Jul 3, 2025
ad91509
Restore spaces dropped by bibliography style
jeltsch Jul 3, 2025
25e2c7d
Change `master` to `main` in GitHub URLs
jeltsch Jul 3, 2025
7e96f24
Fix the URL of the API documentation
jeltsch Jul 3, 2025
d0b1533
Add (no-break) spaces before citation references
jeltsch Jul 3, 2025
1201acc
Slightly improve the beginning of the introduction
jeltsch Jul 3, 2025
9e93f86
Add integration notes section on possible file system incompatibility…
dcoutts Jul 4, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,478 changes: 1,478 additions & 0 deletions doc/final-report/final-report.md

Large diffs are not rendered by default.

17 changes: 17 additions & 0 deletions doc/final-report/ieee-software.csl
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
<?xml version="1.0" encoding="utf-8"?>
<style xmlns="http://purl.org/net/xbiblio/csl" version="1.0" default-locale="en-US">
<!-- Generated with https://github.com/citation-style-language/utilities/tree/master/generate_dependent_styles/data/ieee -->
<info>
<title>IEEE Software</title>
<id>http://www.zotero.org/styles/ieee-software</id>
<link href="http://www.zotero.org/styles/ieee-software" rel="self"/>
<link href="http://www.zotero.org/styles/ieee" rel="independent-parent"/>
<link href="http://ieeexplore.ieee.org/servlet/opac?punumber=52" rel="documentation"/>
<category citation-format="numeric"/>
<category field="engineering"/>
<category field="communications"/>
<issn>0740-7459</issn>
<updated>2014-05-15T02:20:32+00:00</updated>
<rights license="http://creativecommons.org/licenses/by-sa/3.0/">This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License</rights>
</info>
</style>
117 changes: 107 additions & 10 deletions doc/final-report/integration-notes.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,25 @@
# Storing the Cardano ledger state on disk: integration notes for high-performance backend

Authors: Joris Dral, Wolfgang Jeltsch
Date: May 2025

## Sessions
---
title: "Storing the Cardano ledger state on disk:
integration notes for high-performance backend"
author:
- Duncan Coutts
- Joris Dral
- Wolfgang Jeltsch
date: May 2025

toc: true
numbersections: true
classoption:
- 11pt
- a4paper
geometry:
- margin=2.5cm
header-includes:
- \usepackage{microtype}
- \usepackage{mathpazo}
---

# Sessions

Creating new empty tables or opening tables from snapshots requires a `Session`.
The session can be created using `openSession`, which has to be done in the
Expand All @@ -15,7 +31,7 @@ Closing the session will automatically close all tables, but this is only
intended to be a backup functionality: ideally the user closes all tables
manually.

## The compact index
# The compact index

The compact index is a memory-efficient data structure that maintains serialised
keys. Rather than storing full keys, it only stores the first 64 bits of each
Expand Down Expand Up @@ -60,7 +76,7 @@ keys is as good as any other total ordering. However, the consensus layer will
face the situation where a range lookup or a cursor read returns key–value pairs
slightly out of order. Currently, we do not expect this to cause problems.

## Snapshots
# Snapshots

Snapshots currently require support for hard links. This means that on Windows
the library only works when using NTFS. Support for other file systems could be
Expand All @@ -84,7 +100,7 @@ a cheaper non-SSD drive. This feature was unfortunately not anticipated in the
project specification and so is not currently included. As discussed above, it
could be added with some additional work.

## Value resolving
# Value resolving

When instantiating the `ResolveValue` class, it is usually advisable to
implement `resolveValue` such that it works directly on the serialised values.
Expand All @@ -94,7 +110,7 @@ function is intended to work like `(+)`, then `resolveValue` could add the raw
bytes of the serialised values and would likely achieve better performance this
way.

## `io-classes` incompatibility
# `io-classes` incompatibility

At the time of writing, various packages in the `cardano-node` stack depend on
`io-classes-1.5` and the 1.5-versions of its daughter packages, like
Expand Down Expand Up @@ -124,3 +140,84 @@ It is known to us that the `ouroboros-consensus` stack has not been updated to
https://github.com/IntersectMBO/ouroboros-network/pull/4951. We would advise to
fix this Nix-related bug rather than downgrading `lsm-tree`’s dependency on
`io-classes` to version 1.5.

# Security of hash based data structures

Data structures based on hashing have to be considered carefully when they may
be used with untrusted data. If the attacker can control the keys in a hash
table for example, they may be able to arrange for all their keys to have hash
collisions which may cause unexpected performance problems. This is why the
Haskell Cardano node implementation does not use hash tables, and uses
ordering-based containers instead (such as `Data.Map`).

The Bloom filters in an LSM tree are hash based data structures. For performance
they do not use cryptographic hashes. So in principle it would be possibile for
an attacker to arrange that all their keys hash to a common set of bits. This
would be a potential problem for the UTxO and other stake related tables in
Cardano, since it is the users that get to pick (with old modest grinding
difficulty) their UTxO keys (TxIn) and stake keys (verification key hashes). It
would be even more serious if an attacker can grind their set of malicious keys
locally, in the knowledge that the same set of keys will hash the same way on
all other Cardano nodes.

This issue was not considered in the original project specification, but we
have considered it and included a mitigation. The mitigation is that on the
initial creation of a lsm-tree session, a random salt is conjured (from
`/dev/random`) and stored persistenly as part of the session. This salt is then
used as part of the Bloom filter hashing for all runs in all tables in the
session.

The result is that while it is in principle still possible to produce hash
collisions in the Bloom filter, this now depends on knowing the salt. And now
every node has a different salt. So a system wide attack becomes impossible;
instead it is only plausible to target individual nodes. Discovering a node's
salt would also be impractically difficult. In principle there is a timing
side channel, in that collisions will cause more I/O and thus take longer.
An attacker would need to get upstream of a victim node, supply a valid block
and measure the timing of receiving the block downstream. There is however a
large amount of noise.

Overall, our judgement is that this mitigation is practically sufficient, but
it merits a securit review from others who may make a different judgement. It
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"securit" -> "security"

is also worth noting that this issue may occur in other LSM-trees used in other
Cardano and non-Cardano implementations. In particular, RocksDB does not appear
to use a salt at all.

Note that a per-run or per-table hash salt would incur non-trivial costs,
because it would reduce the sharing available in bulk Bloom filter lookups
(looking up N keys in M filters). The Bloom filter lookup is a performance
sensitive part of the overall database implementation.
Comment on lines +186 to +189
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make it extra clear

Suggested change
Note that a per-run or per-table hash salt would incur non-trivial costs,
because it would reduce the sharing available in bulk Bloom filter lookups
(looking up N keys in M filters). The Bloom filter lookup is a performance
sensitive part of the overall database implementation.
Note that a per-run or per-table hash salt would incur non-trivial costs,
because it would reduce the sharing available in bulk Bloom filter lookups
(looking up N keys in M filters). Hence, we store a hash salt on a per-session basis. The Bloom filter lookup is a performance
sensitive part of the overall database implementation.


In the Cardano context, a downside of a per-session (and thus per-node) Bloom
filter salt is that it may interact poorly with sharing of pre-created
databases. While it will work to copy a whole database session (since this
includes the salt), it means the salt is then shared between the nodes. If SPOs
share databases widely with each other (to avoid syning the entire chain), then
the salt diversity is lost. This would be especially acute with Mithril which
shares a single copy of the database. It may be necesary for proper Mithril
support to add a re-salting operation, and to perform this re-salting operation
after cloning a Mithril snapshot. Re-salting would involve re-creating the
Bloom filter for each table run, which involves reading each run and inserting
into a new Bloom filter, and writing out the new Bloom filter. This would of
course be additional development work, but the infrastructure needed is
present already.
Comment on lines +191 to +203
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should advise against distributing lsm-tree folders as part of Mithril snapshots, and advise to use a canonical ledger state snapshot format instead


# Possible file system incompatibility with XFS

The authors have seen at least one platform environment where there was a
failure when using a table configuration with no disk caching (i.e.
`DiskCacheNone`). It is unconfirmed, but the suspicion is that some versions of
the Linux XFS file system (and at least the version on the default AWS Amazon
Linux 2023 AMI) do not support the system call that underlies [`fileSetCaching`]
from the `unix` package. This is an `fcntl` call, used to set the file status
flag `O_DIRECT`. XFS certainly supports `O_DIRECT`, but it may support it only
when the file is opened using this flag, and not when trying to set the flag on
an already open file.

A workaround is to use the EXT4 file system, or use `DiskCacheAll` for the
table configuration (at the cost of using more memory and putting pressure on
the page cache). If this issue is confirmed to be a widespread problem, it may
become necessary to extend the `unix` package to allow setting the `O_DIRECT`
flag for file open.

[`fileSetCaching`]: https://hackage-content.haskell.org/package/unix-2.8.7.0/docs/System-Posix-Fcntl.html#v:fileSetCaching
Binary file added doc/final-report/pipelining.pdf
Binary file not shown.
Binary file added doc/final-report/references/utxo-db-api.pdf
Binary file not shown.
Binary file added doc/final-report/references/utxo-db-lsm.pdf
Binary file not shown.
Binary file added doc/final-report/references/utxo-db.pdf
Binary file not shown.