-
Notifications
You must be signed in to change notification settings - Fork 6.8k
TryCatchUpWithPrimary() returns stale data in two scenarios: stale active memtable after flush, stale immutable memtable after primary reopen #14444
Description
We observed two distinct scenarios where TryCatchUpWithPrimary() returns success but the secondary returns stale values. A standalone Golang reproducer is attached (rocksdb-reproduce-secondary-staled.zip).
Environment: RocksDB 10.10.1, Linux, C API via Go bindings (github.com/linxGnu/grocksdb v1.10.4).
Options used:
Primary:
WALTtlSeconds: 86400
Compression: LZ4
BottommostCompression: ZSTD
PipelinedWrite: enabled
Statistics: enabled (level: All)
Secondary:
MaxOpenFiles: -1
# all other options: RocksDB defaults
Scenario 1: Stale read after primary flush
Observation
[1] Open primary and secondary
[2] Primary: Put("key1", "v1") -- WAL only, no flush
[3] Secondary: TryCatchUpWithPrimary()
[4] Secondary: Get("key1") → "v1" -- correct
[5] Primary: Put("key1", "v2") -- WAL only
[6] Primary: Flush() -- v2 written to SST; old WAL archived, new empty WAL created
[7] Secondary: TryCatchUpWithPrimary()
[8] Secondary: Get("key1") → "v1" -- STALE (expected "v2")
Expected: "v2". Actual: "v1". TryCatchUpWithPrimary returns no error.
Additional observation — the staleness is per column family. After [8], writing to a different CF on the primary does not fix the stale read; writing to the same CF does (Case 2 in the reproducer):
[9] Primary: Put("key2", "x") in CF "others" -- different CF
[10] Secondary: TryCatchUpWithPrimary()
[11] Secondary: Get("key1") → "v1" -- STILL STALE
[12] Primary: Put("key3", "v3") in CF "default" -- same CF
[13] Secondary: TryCatchUpWithPrimary()
[14] Secondary: Get("key1") → "v2" -- CORRECT
Suspected root cause
When TryCatchUpWithPrimary runs after the flush:
- The archived WAL containing
key1=v2is skipped by the sequence check indb_impl_secondary.cc:254–264(seq_of_batch <= SST.largest_seqno— the data is correctly in the SST). Because the batch is skipped, the memtable sealing path indb_impl_secondary.cc:265–282(!mem->IsEmpty() && curr_log_num != log_number) is never reached. - The new WAL created by flush is empty, so its replay loop body never executes.
RemoveOldMemTables(memtable_list.cc:1030–1055) only iterates the immutable memtable list and does not touch the active memtable.
The secondary's active memtable retains the stale key1=v1 from its initial WAL replay. Since GetImpl checks the active memtable first and short-circuits on the first match, the correct value in the SST is never reached.
The active memtable is only replaced when processing new data from a different WAL file (db_impl_secondary.cc:265–282) — which is why a new write to the same CF on the primary (step [12]) resolves it on the next catchup.
Scenario 2: Stale read after primary close + reopen
Observation
[1] Open primary and secondary
[2] Primary: Put("key1", "v1") -- WAL only
[3] Secondary: TryCatchUpWithPrimary()
[4] Secondary: Get("key1") → "v1" -- correct
[5] Primary: Put("key1", "v2")
[6] Primary: Flush() + Close() -- v2 in SST, WAL empty
[7] Primary: Reopen() + Put("key2", "v1") -- write to same CF to avoid Scenario 1
[8] Secondary: TryCatchUpWithPrimary()
[9] Primary: Put("key2", "v2")
[10] Secondary: TryCatchUpWithPrimary()
[11] Secondary: Get("key1") → "v1" -- STALE (expected "v2")
[12] Secondary: Get("key2") → "v2" -- correct (new key, no stale entry)
Expected: "v2" for key1. Actual: "v1". key2 reads correctly, proving that TryCatchUpWithPrimary is working for newly written keys — only the overwritten key is stale.
The stale read self-corrects when the reopened primary flushes (Case 4 in the reproducer):
[13] Primary: Flush()
[14] Secondary: TryCatchUpWithPrimary()
[15] Secondary: Get("key1") → "v2" -- CORRECT
Suspected root cause
Primary reopen creates a WAL number gap. RocksDB uses a single monotonically increasing counter (next_file_number) for SSTs, WALs, etc. During recovery, RecoverLogFiles flushes the old WAL (e.g., 4) to a new SST, which consumes several file numbers (5, 6, …). After recovery completes, SetLogNumber(max_wal + 1 = 5) is written to MANIFEST (db_impl_open.cc:1821–1831) — meaning "all data from WALs ≤ 4 is now in SSTs". The new active WAL is then created with the next available file number after all recovery allocations — e.g., 9. There are no WAL files 5–8; those numbers were consumed by other file types during recovery.
So MANIFEST records log_number = 5 (a pre-reopen high-water mark), while the actual new active WAL is 9. This gap is the root of the issue.
When the secondary processes WAL 9 (step [7]), the sealing condition fires: the old active memtable (containing key1=v1) is sealed into an immutable memtable with next_log_number = 9 (db_impl_secondary.cc:265–282). Then RemoveOldMemTables(log_number=5) checks 9 > 5? YES → KEEP (memtable_list.cc:1030–1055). GetImpl then finds key1=v1 in that immutable memtable and short-circuits before reaching the SST.
The self-correction at step [13] works because flushing the reopened primary writes a new SetLogNumber(≥9) to MANIFEST (flush_job.cc:203–206), so on the next catchup RemoveOldMemTables sees 9 <= new_log_number → REMOVE.
Summary
| Scenario 1 | Scenario 2 | |
|---|---|---|
| Stale memtable | Active | Immutable |
| Trigger | Primary flush (WAL archived, new empty WAL) | Primary close + reopen (WAL number gap) |
| Why not cleaned | Sealing never runs — no new WAL data passes sequence check | next_log_number (9) > log_number (5) — RemoveOldMemTables keeps it |
| Self-corrects | Next same-CF write on primary | Next flush on reopened primary |
| Key code | db_impl_secondary.cc:254–264, 265–282 |
db_impl_open.cc:1821–1831, memtable_list.cc:1030–1055 |
Both scenarios share the fundamental issue: GetImpl short-circuits on the first memtable match. A stale memtable entry silently shadows correct SST data, with no error returned from TryCatchUpWithPrimary.
Questions
- Are these behaviors expected? If so, should the documentation or
TryCatchUpWithPrimary's return value reflect that a successful call does not guarantee the secondary is up to date? - Is there a recommended pattern for secondary users who need reliable reads after a primary flush?
Reproducer
Attached: rocksdb-reproduce-secondary-staled.zip — standalone Go program.
Expected output:
Issue 1: Stale active memtable after flush
Case 1: Stale active memtable after flush FAIL
Case 2: Same CF write recovers stale memtable PASS
Issue 2: Stale immutable memtable after close+reopen
Case 3: Stale immutable memtable after reopen FAIL
Case 4: Flush on reopened primary self-corrects PASS