TryCatchUpWithPrimary() returns stale data in two scenarios: stale active memtable after flush, stale immutable memtable after primary reopen

We observed two distinct scenarios where `TryCatchUpWithPrimary()` returns success but the secondary returns stale values. A standalone Golang reproducer is attached ([rocksdb-reproduce-secondary-staled.zip](https://github.com/user-attachments/files/25871289/rocksdb-reproduce-secondary-staled.zip)).

**Environment:** RocksDB 10.10.1, Linux, C API via Go bindings (`github.com/linxGnu/grocksdb v1.10.4`).

**Options used:**
Primary:
```
WALTtlSeconds:         86400
Compression:           LZ4
BottommostCompression: ZSTD
PipelinedWrite:        enabled
Statistics:            enabled (level: All)
```

Secondary:
```
MaxOpenFiles:          -1
# all other options: RocksDB defaults
```

---

## Scenario 1: Stale read after primary flush

### Observation

```
[1] Open primary and secondary
[2] Primary: Put("key1", "v1")           -- WAL only, no flush
[3] Secondary: TryCatchUpWithPrimary()
[4] Secondary: Get("key1") → "v1"       -- correct
[5] Primary: Put("key1", "v2")           -- WAL only
[6] Primary: Flush()                     -- v2 written to SST; old WAL archived, new empty WAL created
[7] Secondary: TryCatchUpWithPrimary()
[8] Secondary: Get("key1") → "v1"       -- STALE (expected "v2")
```

**Expected:** `"v2"`. **Actual:** `"v1"`. `TryCatchUpWithPrimary` returns no error.

Additional observation — the staleness is **per column family**. After [8], writing to a *different* CF on the primary does not fix the stale read; writing to the *same* CF does (Case 2 in the reproducer):

```
[9]  Primary: Put("key2", "x")  in CF "others"  -- different CF
[10] Secondary: TryCatchUpWithPrimary()
[11] Secondary: Get("key1") → "v1"              -- STILL STALE
[12] Primary: Put("key3", "v3") in CF "default" -- same CF
[13] Secondary: TryCatchUpWithPrimary()
[14] Secondary: Get("key1") → "v2"              -- CORRECT
```

### Suspected root cause

When `TryCatchUpWithPrimary` runs after the flush:

- The archived WAL containing `key1=v2` is skipped by the sequence check in [`db_impl_secondary.cc:254–264`](https://github.com/facebook/rocksdb/blob/v10.10.1/db/db_impl/db_impl_secondary.cc#L254-L264) (`seq_of_batch <= SST.largest_seqno` — the data is correctly in the SST). Because the batch is skipped, the memtable sealing path in [`db_impl_secondary.cc:265–282`](https://github.com/facebook/rocksdb/blob/v10.10.1/db/db_impl/db_impl_secondary.cc#L265-L282) (`!mem->IsEmpty() && curr_log_num != log_number`) is never reached.
- The new WAL created by flush is empty, so its replay loop body never executes.
- `RemoveOldMemTables` ([`memtable_list.cc:1030–1055`](https://github.com/facebook/rocksdb/blob/v10.10.1/db/memtable_list.cc#L1030-L1055)) only iterates the *immutable* memtable list and does not touch the active memtable.

The secondary's active memtable retains the stale `key1=v1` from its initial WAL replay. Since `GetImpl` checks the active memtable first and short-circuits on the first match, the correct value in the SST is never reached.

The active memtable is only replaced when processing new data from a *different* WAL file ([`db_impl_secondary.cc:265–282`](https://github.com/facebook/rocksdb/blob/v10.10.1/db/db_impl/db_impl_secondary.cc#L265-L282)) — which is why a new write to the same CF on the primary (step [12]) resolves it on the next catchup.

---

## Scenario 2: Stale read after primary close + reopen

### Observation

```
[1]  Open primary and secondary
[2]  Primary: Put("key1", "v1")              -- WAL only
[3]  Secondary: TryCatchUpWithPrimary()
[4]  Secondary: Get("key1") → "v1"          -- correct
[5]  Primary: Put("key1", "v2")
[6]  Primary: Flush() + Close()              -- v2 in SST, WAL empty
[7]  Primary: Reopen() + Put("key2", "v1")  -- write to same CF to avoid Scenario 1
[8]  Secondary: TryCatchUpWithPrimary()
[9]  Primary: Put("key2", "v2")
[10] Secondary: TryCatchUpWithPrimary()
[11] Secondary: Get("key1") → "v1"          -- STALE (expected "v2")
[12] Secondary: Get("key2") → "v2"          -- correct (new key, no stale entry)
```

**Expected:** `"v2"` for `key1`. **Actual:** `"v1"`. `key2` reads correctly, proving that `TryCatchUpWithPrimary` is working for newly written keys — only the overwritten key is stale.

The stale read self-corrects when the reopened primary flushes (Case 4 in the reproducer):

```
[13] Primary: Flush()
[14] Secondary: TryCatchUpWithPrimary()
[15] Secondary: Get("key1") → "v2"          -- CORRECT
```

### Suspected root cause

Primary reopen creates a WAL number gap. RocksDB uses a single monotonically increasing counter (`next_file_number`) for SSTs, WALs, etc. During recovery, `RecoverLogFiles` flushes the old WAL (e.g., 4) to a new SST, which consumes several file numbers (5, 6, …). After recovery completes, `SetLogNumber(max_wal + 1 = 5)` is written to MANIFEST ([`db_impl_open.cc:1821–1831`](https://github.com/facebook/rocksdb/blob/v10.10.1/db/db_impl/db_impl_open.cc#L1821-L1831)) — meaning "all data from WALs ≤ 4 is now in SSTs". The new active WAL is then created with the next available file number after all recovery allocations — e.g., 9. There are no WAL files 5–8; those numbers were consumed by other file types during recovery.

So MANIFEST records `log_number = 5` (a pre-reopen high-water mark), while the actual new active WAL is 9. This gap is the root of the issue.

When the secondary processes WAL 9 (step [7]), the sealing condition fires: the old active memtable (containing `key1=v1`) is sealed into an immutable memtable with `next_log_number = 9` ([`db_impl_secondary.cc:265–282`](https://github.com/facebook/rocksdb/blob/v10.10.1/db/db_impl/db_impl_secondary.cc#L265-L282)). Then `RemoveOldMemTables(log_number=5)` checks `9 > 5? YES → KEEP` ([`memtable_list.cc:1030–1055`](https://github.com/facebook/rocksdb/blob/v10.10.1/db/memtable_list.cc#L1030-L1055)). `GetImpl` then finds `key1=v1` in that immutable memtable and short-circuits before reaching the SST.

The self-correction at step [13] works because flushing the reopened primary writes a new `SetLogNumber(≥9)` to MANIFEST ([`flush_job.cc:203–206`](https://github.com/facebook/rocksdb/blob/v10.10.1/db/flush_job.cc#L203-L206)), so on the next catchup `RemoveOldMemTables` sees `9 <= new_log_number → REMOVE`.

---

## Summary

| | Scenario 1 | Scenario 2 |
|---|---|---|
| Stale memtable | Active | Immutable |
| Trigger | Primary flush (WAL archived, new empty WAL) | Primary close + reopen (WAL number gap) |
| Why not cleaned | Sealing never runs — no new WAL data passes sequence check | `next_log_number (9) > log_number (5)` — `RemoveOldMemTables` keeps it |
| Self-corrects | Next same-CF write on primary | Next flush on reopened primary |
| Key code | [`db_impl_secondary.cc:254–264`](https://github.com/facebook/rocksdb/blob/v10.10.1/db/db_impl/db_impl_secondary.cc#L254-L264), [`265–282`](https://github.com/facebook/rocksdb/blob/v10.10.1/db/db_impl/db_impl_secondary.cc#L265-L282) | [`db_impl_open.cc:1821–1831`](https://github.com/facebook/rocksdb/blob/v10.10.1/db/db_impl/db_impl_open.cc#L1821-L1831), [`memtable_list.cc:1030–1055`](https://github.com/facebook/rocksdb/blob/v10.10.1/db/memtable_list.cc#L1030-L1055) |

Both scenarios share the fundamental issue: `GetImpl` short-circuits on the first memtable match. A stale memtable entry silently shadows correct SST data, with no error returned from `TryCatchUpWithPrimary`.

---

## Questions

1. Are these behaviors expected? If so, should the documentation or `TryCatchUpWithPrimary`'s return value reflect that a successful call does not guarantee the secondary is up to date?
2. Is there a recommended pattern for secondary users who need reliable reads after a primary flush?

---

## Reproducer

Attached: `rocksdb-reproduce-secondary-staled.zip` — standalone Go program.

Expected output:
```
Issue 1: Stale active memtable after flush
  Case 1: Stale active memtable after flush               FAIL
  Case 2: Same CF write recovers stale memtable           PASS

Issue 2: Stale immutable memtable after close+reopen
  Case 3: Stale immutable memtable after reopen           FAIL
  Case 4: Flush on reopened primary self-corrects         PASS
```




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TryCatchUpWithPrimary() returns stale data in two scenarios: stale active memtable after flush, stale immutable memtable after primary reopen #14444

Scenario 1: Stale read after primary flush

Observation

Suspected root cause

Scenario 2: Stale read after primary close + reopen

Observation

Suspected root cause

Summary

Questions

Reproducer

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	Scenario 1	Scenario 2
Stale memtable	Active	Immutable
Trigger	Primary flush (WAL archived, new empty WAL)	Primary close + reopen (WAL number gap)
Why not cleaned	Sealing never runs — no new WAL data passes sequence check	`next_log_number (9) > log_number (5)` — `RemoveOldMemTables` keeps it
Self-corrects	Next same-CF write on primary	Next flush on reopened primary
Key code	`db_impl_secondary.cc:254–264`, `265–282`	`db_impl_open.cc:1821–1831`, `memtable_list.cc:1030–1055`

TryCatchUpWithPrimary() returns stale data in two scenarios: stale active memtable after flush, stale immutable memtable after primary reopen #14444

Description

Scenario 1: Stale read after primary flush

Observation

Suspected root cause

Scenario 2: Stale read after primary close + reopen

Observation

Suspected root cause

Summary

Questions

Reproducer

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions