-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[Performance] Double-buffered async WAL flush to reduce write stalls in manual_wal_flush mode #14525
Description
Summary
Currently, when manual_wal_flush is enabled, FlushWAL() holds log_write_mutex_ for the entire duration of the disk write (WriteBuffer()). This blocks all concurrent WAL AddRecord() calls for the full I/O duration, creating a latency spike proportional to the amount of buffered data — which can be significant under high-throughput workloads since the whole point of manual_wal_flush is to batch and defer WAL writes.
This PR addresses the problem by implementing double-buffered asynchronous WAL flushing. FlushWAL() now swaps the write buffer under the mutex (a fast in-memory operation) and then writes to disk without holding log_write_mutex_, so that AddRecord() (i.e. user writes) can continue into a fresh buffer concurrently.
Key changes
ManualFlushWritableFileWritersubclass — Extracts manual-flush double-buffering (SwapBuffer/FlushSwappedBuffer) and unbounded-append-buffer policy out ofWritableFileWriterinto a dedicated subclass. The base class no longer carriesmanual_flush_state; instead, virtual methods (AppendBufferSizeLimit(),ShouldImplicitFlushOnAppend()) provide the polymorphic behavior.- Double-buffered
FlushWAL()— Underlog_write_mutex_,FlushWAL()callsSwapBuffer()to move the full write buffer into a secondary flush buffer, then releases the mutex and writes the flush buffer to disk viaFlushSwappedBuffer(). NewAppend()calls proceed into the fresh primary buffer without blocking. LogWriterNumber::getting_flushedstate — A new per-log flag tracks whether a log's buffer is being flushed to disk.SyncWAL(),SyncClosedLogs(), andFindObsoleteFiles()now wait for bothIsSyncing()andIsFlushing()to clear before proceeding, preventing data races between concurrentFlush()/Append()andSync()on the same underlying file.log_sync_cv_→wal_io_cv_— The condition variable is renamed to reflect its broader role: it now signals completion of both sync and flush operations.WritableFileWriterrefactoring —WriteBuffered()is split into a reusableWriteToFile()(core write loop with rate limiting, no buf_ side effects) andWriteBuffered()(callsWriteToFilethen clears buf_).buf_andmax_buffer_size_are moved toprotectedfor subclass access. The destructor is madevirtual.
Concurrency protocol
FlushWAL thread Writer threads
───────────────── ──────────────
lock(log_write_mutex_)
wait while IsFlushing()
PrepareForFlush()
SwapBuffer() ← buf_ is now empty
unlock(log_write_mutex_)
Append() into fresh buf_ (no blocking)
FlushSwappedBuffer() ...
lock(log_write_mutex_)
FinishFlush()
signal(wal_io_cv_)
unlock(log_write_mutex_)
Motivation
In the current manual_wal_flush implementation, FlushWAL() holds log_write_mutex_ for the entire duration of the disk write (WriteBuffer()). This blocks all concurrent WAL AddRecord() calls, creating a latency spike proportional to the buffered data size. With double-buffering, the mutex is only held for the fast in-memory buffer swap, and the slow disk I/O happens concurrently with new writes.
Test plan
- Existing unit tests pass (
db_test,db_wal_test,db_flush_test,fault_injection_test) -
db_benchwith--manual_wal_flush=1shows reduced p99 write latency under concurrent load - Stress test with
db_stress --manual_wal_flush=1to verify no data loss or corruption - Verify
SyncWAL()andSyncClosedLogs()correctly wait for in-flight flushes before syncing