lazily intialize iterators (#14772)#14772
Conversation
|
@stephpontikes has exported this pull request. If you are a Meta employee, you can view the originating Diff in D104904298. |
a88500a to
fa31550
Compare
Summary: Updated the iterator creation scheme to happen lazily (on request) as oppsed to eagerly. this allows us to prune the iterator tree structure at the time of requesting iterator preparation as opposed to creation, and allows pruning to become an implementation detail. Version now skips non-overlapping SST levels and files before adding children to the iterator tree, returns direct table iterators when a level has a single matching file, and uses pruned LevelIterator instances when multiple files in one non-L0 level match. The overload no longer prepares iterators during creation; callers that need prepared multiscan execution still call Prepare explicitly after construction, and MultiScan does that itself. Benchmark: ran `db_bench` in opt mode for the base revision and this diff, with `fillseq,compact,levelstats,multiscanrandom`, `--num=1000000`, `--reads=10000000`, single thread, fixed seeds, `--multiscan_use_async_io=false`, and `--use_multiscan=true`. Both A and B had exactly one SST file and no memtable/L0 data (`L0: 0 files`, `L1: 1 file, 61 MB`). `multiscanrandom` creates `MultiScanArgs` and calls `NewMultiScan(...)`, which reaches the new `NewIterator(..., scan_opts)` pruning path in this diff. ``` seed base A pruning B delta 424242 21824.333 17693.333 -18.9% 424243 24042.014 19424.056 -19.2% 424244 22424.974 17636.910 -21.4% 424245 22404.213 18612.840 -16.9% ``` Average: base `22673.9 us/op`, pruning `18341.8 us/op`, about `19.1%` faster. Differential Revision: D104904298
✅ clang-tidy: No findings on changed linesCompleted in 553.5s. |
fa31550 to
b180438
Compare
Summary: Updated the iterator creation scheme to happen lazily (on request) as oppsed to eagerly. this allows us to prune the iterator tree structure at the time of requesting iterator preparation as opposed to creation, and allows pruning to become an implementation detail. Version now skips non-overlapping SST levels and files before adding children to the iterator tree, returns direct table iterators when a level has a single matching file, and uses pruned LevelIterator instances when multiple files in one non-L0 level match. The overload no longer prepares iterators during creation; callers that need prepared multiscan execution still call Prepare explicitly after construction, and MultiScan does that itself. Benchmark: ran `db_bench` in opt mode for the base revision and this diff, with `fillseq,compact,levelstats,multiscanrandom`, `--num=1000000`, `--reads=10000000`, single thread, fixed seeds, `--multiscan_use_async_io=false`, and `--use_multiscan=true`. Both A and B had exactly one SST file and no memtable/L0 data (`L0: 0 files`, `L1: 1 file, 61 MB`). `multiscanrandom` creates `MultiScanArgs` and calls `NewMultiScan(...)`, which reaches the new `NewIterator(..., scan_opts)` pruning path in this diff. ``` seed base A pruning B delta 424242 21824.333 17693.333 -18.9% 424243 24042.014 19424.056 -19.2% 424244 22424.974 17636.910 -21.4% 424245 22404.213 18612.840 -16.9% ``` Average: base `22673.9 us/op`, pruning `18341.8 us/op`, about `19.1%` faster. Differential Revision: D104904298
Codex Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit b180438 ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
Claude Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit b180438 SummaryPR #14772 introduces lazy iterator initialization and MultiScan-based pruning for RocksDB iterators. The design is sound -- deferring internal iterator tree construction to first use enables pruning non-overlapping memtables/files/levels, yielding ~19% improvement for single-SST MultiScan workloads. The SuperVersion lifecycle, error handling, and conservative pruning for range deletions are handled correctly. A few medium-severity issues remain around wasteful Refresh behavior, missing branch hints on hot paths, and test coverage gaps. High-severity findings (0): No high-severity findings. Full review (click to expand)Findings🔴 HIGHNo high-severity findings after cross-agent debate and verification. 🟡 MEDIUMM1. Wasteful Refresh() on uninitialized iterator --
|
| Context | Executes? | Safe? | Notes |
|---|---|---|---|
| WritePreparedTxnDB | YES | YES | read_callback preserved through lazy init |
| ReadOnly DB | YES | YES | allow_refresh=false |
| SstFileReader | NO | N/A | Uses direct Init() |
| CF deletion before use | YES | YES | SuperVersion holds CFD ref; new tests verify |
| User-defined timestamps | YES | YES | CompareWithoutTimestamp used correctly |
| Prefix seek | YES | YES | Pruning uses separate ReadOptions |
Positive Observations
- Correct SuperVersion lifecycle: Both happy path and destruction-without-init properly managed.
- Conservative pruning: Memtables with range deletions never pruned (
NumRangeDeletion() > 0 => return true). - Good test coverage: 8 new tests cover L0/level/memtable pruning, CF lifecycle, dedup, edge cases.
- Clean factoring: Well-extracted helpers reduce duplication.
- Solid benchmark: ~19% improvement across 4 seeds.
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
Summary: Updated the iterator creation scheme to happen lazily (on request) as oppsed to eagerly. this allows us to prune the iterator tree structure at the time of requesting iterator preparation as opposed to creation, and allows pruning to become an implementation detail. Version now skips non-overlapping SST levels and files before adding children to the iterator tree, returns direct table iterators when a level has a single matching file, and uses pruned LevelIterator instances when multiple files in one non-L0 level match. The overload no longer prepares iterators during creation; callers that need prepared multiscan execution still call Prepare explicitly after construction, and MultiScan does that itself. Benchmark: ran `db_bench` in opt mode for the base revision and this diff, with `fillseq,compact,levelstats,multiscanrandom`, `--num=1000000`, `--reads=10000000`, single thread, fixed seeds, `--multiscan_use_async_io=false`, and `--use_multiscan=true`. Both A and B had exactly one SST file and no memtable/L0 data (`L0: 0 files`, `L1: 1 file, 61 MB`). `multiscanrandom` creates `MultiScanArgs` and calls `NewMultiScan(...)`, which reaches the new `NewIterator(..., scan_opts)` pruning path in this diff. ``` seed base A pruning B delta 424242 21824.333 17693.333 -18.9% 424243 24042.014 19424.056 -19.2% 424244 22424.974 17636.910 -21.4% 424245 22404.213 18612.840 -16.9% ``` Average: base `22673.9 us/op`, pruning `18341.8 us/op`, about `19.1%` faster. Differential Revision: D104904298
b180438 to
8b6da52
Compare
Summary: Updated the iterator creation scheme to happen lazily (on request) as oppsed to eagerly. this allows us to prune the iterator tree structure at the time of requesting iterator preparation as opposed to creation, and allows pruning to become an implementation detail. Version now skips non-overlapping SST levels and files before adding children to the iterator tree, returns direct table iterators when a level has a single matching file, and uses pruned LevelIterator instances when multiple files in one non-L0 level match. The overload no longer prepares iterators during creation; callers that need prepared multiscan execution still call Prepare explicitly after construction, and MultiScan does that itself. Benchmark: ran `db_bench` in opt mode for the base revision and this diff, with `fillseq,compact,levelstats,multiscanrandom`, `--num=1000000`, `--reads=10000000`, single thread, fixed seeds, `--multiscan_use_async_io=false`, and `--use_multiscan=true`. Both A and B had exactly one SST file and no memtable/L0 data (`L0: 0 files`, `L1: 1 file, 61 MB`). `multiscanrandom` creates `MultiScanArgs` and calls `NewMultiScan(...)`, which reaches the new `NewIterator(..., scan_opts)` pruning path in this diff. ``` seed base A pruning B delta 424242 21824.333 17693.333 -18.9% 424243 24042.014 19424.056 -19.2% 424244 22424.974 17636.910 -21.4% 424245 22404.213 18612.840 -16.9% ``` Average: base `22673.9 us/op`, pruning `18341.8 us/op`, about `19.1%` faster. Differential Revision: D104904298
8b6da52 to
58b5825
Compare
🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit 58b5825 ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit 58b5825 SummarySolid performance optimization (~19% for MultiScan) with careful lifecycle management. The lazy initialization design is sound: SuperVersion references are properly maintained through deferred state, and sequence numbers are correctly captured at creation time. Several medium-severity issues found around missing branch prediction hints, a potential L0 pruning edge case, and test coverage gaps. High-severity findings (0): No high-severity findings. Full review (click to expand)Findings🟡 MEDIUMM1. Missing LIKELY/UNLIKELY on hot-path lazy-init check --
|
| Context | Affected? | Assessment |
|---|---|---|
| WritePreparedTxnDB | YES - overrides NewIterator | Safe: read_callback_ is set during Init() before lazy init; sequence captured at creation |
| ReadOnly DB | YES - uses NewArenaWrappedDbIterator | Safe: allow_refresh=false, deferred init works the same |
| SecondaryInstance | YES - uses NewArenaWrappedDbIterator | Safe: same deferred pattern applies |
| User-defined timestamps | YES - pruning uses CompareWithoutTimestamp | Safe: MultiScanInternalKey correctly handles timestamp_size > 0 |
| BlobDB | YES - allow_blob_write_path_fallback_ extracted from CFH |
Safe: pre-extracted at DBIter construction time |
| FIFO/Universal compaction | Not directly affected | Safe: pruning operates on file metadata, agnostic to compaction style |
| Concurrent writers | Not directly affected | Safe: SV reference prevents concurrent cleanup |
| Old snapshots | YES - child_read_options_.snapshot = nullptr |
Safe: sequence_ carries the correct seqno, snapshot pointer not dereferenced |
Assumption stress-test results:
- Claim: "Pruning is safe" - CONFIRMED.
HasBoundedScanRanges()returns false for unbounded ranges. Memtables with range tombstones are conservatively kept (explicitNumRangeDeletion() > 0check). Scan ranges are validated as sorted/non-overlapping before pruning. - Claim: "SuperVersion is properly managed" - CONFIRMED.
GetReferencedSuperVersionadds a ref.CleanupDeferredSuperVersioncallsCleanupIteratorSuperVersionwhich doesUnref(). After lazy init, SV ownership transfers to theSuperVersionHandlecleanup registered on the internal iterator. - Claim: "CFD is safe" - CONFIRMED. SV holds a reference to CFD. As long as
deferred_sv_is alive, CFD cannot be freed. TestsIteratorSeekAfterCfDeleteandIteratorSeekAfterCfDropverify this. - Claim: "Setting snapshot to nullptr is safe" - CONFIRMED.
LevelIteratornow uses explicitread_seqparameter.MemTableListVersion::AddIteratorsuses explicitread_seq. The snapshot pointer is not needed.
Positive Observations
- Clean lifecycle management: The
DestroyDBIter/CleanupDeferredSuperVersion/DestroyDBIterAndArenadecomposition is well-structured and handles all cleanup paths correctly. - Conservative memtable pruning: The check
NumRangeDeletion() > 0correctly prevents pruning memtables that might have cross-range tombstones. - Good test coverage: New tests cover L0 pruning, level pruning, memtable pruning, multi-range dedup, CF deletion/drop before seek, and repeated Prepare rejection.
- Refactoring of range_del_iter ownership: Moving from raw
new/deletetostd::unique_ptr<FragmentedRangeTombstoneIterator>in memtable_list.cc is a good cleanup. - DBIter CFH decoupling: Replacing
cfh_with pre-extracted fields (trace_db_,trace_cf_id_,ingest_sst_lock_, etc.) is a sound design that removes the DBIter's dependency on CFH lifetime. - Sequence number parameter threading: Passing
read_seqexplicitly throughAddIterators->AddIteratorsForLevel->LevelIteratorconstructor is more robust than relying onread_options.snapshotwhich may be released.
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
Summary:
Updated the iterator creation scheme to happen lazily (on request) as oppsed to eagerly. this allows us to prune the iterator tree structure at the time of requesting iterator preparation as opposed to creation, and allows pruning to become an implementation detail. Version now skips non-overlapping SST levels and files before adding children to the iterator tree, returns direct table iterators when a level has a single matching file, and uses pruned LevelIterator instances when multiple files in one non-L0 level match. The overload no longer prepares iterators during creation; callers that need prepared multiscan execution still call Prepare explicitly after construction, and MultiScan does that itself.
Benchmark: ran
db_benchin opt mode for the base revision and this diff, withfillseq,compact,levelstats,multiscanrandom,--num=1000000,--reads=10000000, single thread, fixed seeds,--multiscan_use_async_io=false, and--use_multiscan=true. Both A and B had exactly one SST file and no memtable/L0 data (L0: 0 files,L1: 1 file, 61 MB).multiscanrandomcreatesMultiScanArgsand callsNewMultiScan(...), which reaches the newNewIterator(..., scan_opts)pruning path in this diff.Average: base
22673.9 us/op, pruning18341.8 us/op, about19.1%faster.Differential Revision: D104904298