Skip to content

TryCatchUpWithPrimary() returns stale data in two scenarios: stale active memtable after flush, stale immutable memtable after primary reopen #14444

@kinolollipop

Description

@kinolollipop

We observed two distinct scenarios where TryCatchUpWithPrimary() returns success but the secondary returns stale values. A standalone Golang reproducer is attached (rocksdb-reproduce-secondary-staled.zip).

Environment: RocksDB 10.10.1, Linux, C API via Go bindings (github.com/linxGnu/grocksdb v1.10.4).

Options used:
Primary:

WALTtlSeconds:         86400
Compression:           LZ4
BottommostCompression: ZSTD
PipelinedWrite:        enabled
Statistics:            enabled (level: All)

Secondary:

MaxOpenFiles:          -1
# all other options: RocksDB defaults

Scenario 1: Stale read after primary flush

Observation

[1] Open primary and secondary
[2] Primary: Put("key1", "v1")           -- WAL only, no flush
[3] Secondary: TryCatchUpWithPrimary()
[4] Secondary: Get("key1") → "v1"       -- correct
[5] Primary: Put("key1", "v2")           -- WAL only
[6] Primary: Flush()                     -- v2 written to SST; old WAL archived, new empty WAL created
[7] Secondary: TryCatchUpWithPrimary()
[8] Secondary: Get("key1") → "v1"       -- STALE (expected "v2")

Expected: "v2". Actual: "v1". TryCatchUpWithPrimary returns no error.

Additional observation — the staleness is per column family. After [8], writing to a different CF on the primary does not fix the stale read; writing to the same CF does (Case 2 in the reproducer):

[9]  Primary: Put("key2", "x")  in CF "others"  -- different CF
[10] Secondary: TryCatchUpWithPrimary()
[11] Secondary: Get("key1") → "v1"              -- STILL STALE
[12] Primary: Put("key3", "v3") in CF "default" -- same CF
[13] Secondary: TryCatchUpWithPrimary()
[14] Secondary: Get("key1") → "v2"              -- CORRECT

Suspected root cause

When TryCatchUpWithPrimary runs after the flush:

  • The archived WAL containing key1=v2 is skipped by the sequence check in db_impl_secondary.cc:254–264 (seq_of_batch <= SST.largest_seqno — the data is correctly in the SST). Because the batch is skipped, the memtable sealing path in db_impl_secondary.cc:265–282 (!mem->IsEmpty() && curr_log_num != log_number) is never reached.
  • The new WAL created by flush is empty, so its replay loop body never executes.
  • RemoveOldMemTables (memtable_list.cc:1030–1055) only iterates the immutable memtable list and does not touch the active memtable.

The secondary's active memtable retains the stale key1=v1 from its initial WAL replay. Since GetImpl checks the active memtable first and short-circuits on the first match, the correct value in the SST is never reached.

The active memtable is only replaced when processing new data from a different WAL file (db_impl_secondary.cc:265–282) — which is why a new write to the same CF on the primary (step [12]) resolves it on the next catchup.


Scenario 2: Stale read after primary close + reopen

Observation

[1]  Open primary and secondary
[2]  Primary: Put("key1", "v1")              -- WAL only
[3]  Secondary: TryCatchUpWithPrimary()
[4]  Secondary: Get("key1") → "v1"          -- correct
[5]  Primary: Put("key1", "v2")
[6]  Primary: Flush() + Close()              -- v2 in SST, WAL empty
[7]  Primary: Reopen() + Put("key2", "v1")  -- write to same CF to avoid Scenario 1
[8]  Secondary: TryCatchUpWithPrimary()
[9]  Primary: Put("key2", "v2")
[10] Secondary: TryCatchUpWithPrimary()
[11] Secondary: Get("key1") → "v1"          -- STALE (expected "v2")
[12] Secondary: Get("key2") → "v2"          -- correct (new key, no stale entry)

Expected: "v2" for key1. Actual: "v1". key2 reads correctly, proving that TryCatchUpWithPrimary is working for newly written keys — only the overwritten key is stale.

The stale read self-corrects when the reopened primary flushes (Case 4 in the reproducer):

[13] Primary: Flush()
[14] Secondary: TryCatchUpWithPrimary()
[15] Secondary: Get("key1") → "v2"          -- CORRECT

Suspected root cause

Primary reopen creates a WAL number gap. RocksDB uses a single monotonically increasing counter (next_file_number) for SSTs, WALs, etc. During recovery, RecoverLogFiles flushes the old WAL (e.g., 4) to a new SST, which consumes several file numbers (5, 6, …). After recovery completes, SetLogNumber(max_wal + 1 = 5) is written to MANIFEST (db_impl_open.cc:1821–1831) — meaning "all data from WALs ≤ 4 is now in SSTs". The new active WAL is then created with the next available file number after all recovery allocations — e.g., 9. There are no WAL files 5–8; those numbers were consumed by other file types during recovery.

So MANIFEST records log_number = 5 (a pre-reopen high-water mark), while the actual new active WAL is 9. This gap is the root of the issue.

When the secondary processes WAL 9 (step [7]), the sealing condition fires: the old active memtable (containing key1=v1) is sealed into an immutable memtable with next_log_number = 9 (db_impl_secondary.cc:265–282). Then RemoveOldMemTables(log_number=5) checks 9 > 5? YES → KEEP (memtable_list.cc:1030–1055). GetImpl then finds key1=v1 in that immutable memtable and short-circuits before reaching the SST.

The self-correction at step [13] works because flushing the reopened primary writes a new SetLogNumber(≥9) to MANIFEST (flush_job.cc:203–206), so on the next catchup RemoveOldMemTables sees 9 <= new_log_number → REMOVE.


Summary

Scenario 1 Scenario 2
Stale memtable Active Immutable
Trigger Primary flush (WAL archived, new empty WAL) Primary close + reopen (WAL number gap)
Why not cleaned Sealing never runs — no new WAL data passes sequence check next_log_number (9) > log_number (5)RemoveOldMemTables keeps it
Self-corrects Next same-CF write on primary Next flush on reopened primary
Key code db_impl_secondary.cc:254–264, 265–282 db_impl_open.cc:1821–1831, memtable_list.cc:1030–1055

Both scenarios share the fundamental issue: GetImpl short-circuits on the first memtable match. A stale memtable entry silently shadows correct SST data, with no error returned from TryCatchUpWithPrimary.


Questions

  1. Are these behaviors expected? If so, should the documentation or TryCatchUpWithPrimary's return value reflect that a successful call does not guarantee the secondary is up to date?
  2. Is there a recommended pattern for secondary users who need reliable reads after a primary flush?

Reproducer

Attached: rocksdb-reproduce-secondary-staled.zip — standalone Go program.

Expected output:

Issue 1: Stale active memtable after flush
  Case 1: Stale active memtable after flush               FAIL
  Case 2: Same CF write recovers stale memtable           PASS

Issue 2: Stale immutable memtable after close+reopen
  Case 3: Stale immutable memtable after reopen           FAIL
  Case 4: Flush on reopened primary self-corrects         PASS

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions