Skip to content

[Performance] Double-buffered async WAL flush to reduce write stalls in manual_wal_flush mode #14525

@wolfkdy

Description

@wolfkdy

Summary

Currently, when manual_wal_flush is enabled, FlushWAL() holds log_write_mutex_ for the entire duration of the disk write (WriteBuffer()). This blocks all concurrent WAL AddRecord() calls for the full I/O duration, creating a latency spike proportional to the amount of buffered data — which can be significant under high-throughput workloads since the whole point of manual_wal_flush is to batch and defer WAL writes.

This PR addresses the problem by implementing double-buffered asynchronous WAL flushing. FlushWAL() now swaps the write buffer under the mutex (a fast in-memory operation) and then writes to disk without holding log_write_mutex_, so that AddRecord() (i.e. user writes) can continue into a fresh buffer concurrently.

Key changes

  • ManualFlushWritableFileWriter subclass — Extracts manual-flush double-buffering (SwapBuffer / FlushSwappedBuffer) and unbounded-append-buffer policy out of WritableFileWriter into a dedicated subclass. The base class no longer carries manual_flush_ state; instead, virtual methods (AppendBufferSizeLimit(), ShouldImplicitFlushOnAppend()) provide the polymorphic behavior.
  • Double-buffered FlushWAL() — Under log_write_mutex_, FlushWAL() calls SwapBuffer() to move the full write buffer into a secondary flush buffer, then releases the mutex and writes the flush buffer to disk via FlushSwappedBuffer(). New Append() calls proceed into the fresh primary buffer without blocking.
  • LogWriterNumber::getting_flushed state — A new per-log flag tracks whether a log's buffer is being flushed to disk. SyncWAL(), SyncClosedLogs(), and FindObsoleteFiles() now wait for both IsSyncing() and IsFlushing() to clear before proceeding, preventing data races between concurrent Flush()/Append() and Sync() on the same underlying file.
  • log_sync_cv_wal_io_cv_ — The condition variable is renamed to reflect its broader role: it now signals completion of both sync and flush operations.
  • WritableFileWriter refactoringWriteBuffered() is split into a reusable WriteToFile() (core write loop with rate limiting, no buf_ side effects) and WriteBuffered() (calls WriteToFile then clears buf_). buf_ and max_buffer_size_ are moved to protected for subclass access. The destructor is made virtual.

Concurrency protocol

FlushWAL thread               Writer threads
─────────────────              ──────────────
lock(log_write_mutex_)
  wait while IsFlushing()
  PrepareForFlush()
  SwapBuffer()                 ← buf_ is now empty
unlock(log_write_mutex_)
                               Append() into fresh buf_ (no blocking)
FlushSwappedBuffer()           ...
lock(log_write_mutex_)
  FinishFlush()
  signal(wal_io_cv_)
unlock(log_write_mutex_)

Motivation

In the current manual_wal_flush implementation, FlushWAL() holds log_write_mutex_ for the entire duration of the disk write (WriteBuffer()). This blocks all concurrent WAL AddRecord() calls, creating a latency spike proportional to the buffered data size. With double-buffering, the mutex is only held for the fast in-memory buffer swap, and the slow disk I/O happens concurrently with new writes.

Test plan

  • Existing unit tests pass (db_test, db_wal_test, db_flush_test, fault_injection_test)
  • db_bench with --manual_wal_flush=1 shows reduced p99 write latency under concurrent load
  • Stress test with db_stress --manual_wal_flush=1 to verify no data loss or corruption
  • Verify SyncWAL() and SyncClosedLogs() correctly wait for in-flight flushes before syncing

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions