Skip to content

[Store]Optimize uring file support for SSD offloading#1562

Merged
ykwd merged 37 commits into
kvcache-ai:mainfrom
zhangzuo21:better-loadbuffer-management
Mar 12, 2026
Merged

[Store]Optimize uring file support for SSD offloading#1562
ykwd merged 37 commits into
kvcache-ai:mainfrom
zhangzuo21:better-loadbuffer-management

Conversation

@zhangzuo21
Copy link
Copy Markdown
Collaborator

Description

Type of Change

  • Types
    • Bug fix
    • [ ✅] New feature
      • Transfer Engine
      • [ ✅] Mooncake Store
      • Mooncake EP
      • Integration
      • P2P Store
      • Python Wheel
    • Breaking change
    • CI/CD
    • Documentation update
    • Other

How Has This Been Tested?

Checklist

  • [ ✅] I have performed a self-review of my own code.
  • [ ✅] I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

Summary

This PR contains three independent improvements to the offload (SSD-backed) path in Mooncake Store.

1. Per-thread io_uring ring (uring_file.cpp)

Previously each UringFile instance owned its own io_uring ring, which meant every file open/close incurred the overhead of io_uring_queue_init / io_uring_queue_exit system calls, and concurrent I/O across threads required mutex protection on the shared ring.

This PR replaces the per-file ring with a thread_local SharedUringRing singleton. Each thread initializes exactly one ring on first use and the ring lives for the lifetime of that thread. Key benefits:

  • Zero mutex contention: each ring is exclusively owned by its thread; no locking is needed for I/O operations.
  • Drastically reduced syscall overhead: ring init/exit no longer happens on every file open/close.
  • Better NVMe queue depth utilization: within a single thread, multiple SQEs can be batched and submitted together (up to QUEUE_DEPTH=32), exposing deeper device parallelism.
  • Fixed-buffer registration: the registered memory buffer is stored globally (g_buf) and lazily registered on each thread-local ring on first use, maintaining zero-copy I/O across all threads.

2. Early load buffer release via RPC (real_client.cppbatch_get_into_offload_object_internal)

When an object is loaded from SSD and transferred from one client to another over the network, the sender holds a load buffer for that object until a GC lease TTL expires. Previously the buffer was not reclaimed until this timer fired.

This PR adds a fire-and-forget release_offload_buffer RPC call sent by the receiver immediately after the data transfer completes. The sender releases the load buffer upon receiving this RPC, reclaiming the memory much earlier than the GC deadline. If the RPC fails (e.g., network issue), the existing GC mechanism acts as a safety backstop.

3. Eviction protection for items in the offloading queue (master_service.cpp)

Items queued for offloading (SSD write) could previously be evicted from memory before the offloading task completed, leading to potential data loss or use-after-free bugs.

This PR increments the replica's reference count (inc_refcnt) when a replica is pushed onto the offloading queue, and decrements it (dec_refcnt) when the task completes, is revoked, or times out. The eviction logic already requires get_refcnt() == 0 before a replica can be evicted, so this change provides a clean, race-free protection window for all in-flight offloading tasks. The same refcount protection pattern is also applied symmetrically to COPY and MOVE replication tasks.

zhuxinjie-nz and others added 30 commits December 12, 2025 13:46
# Conflicts:
#	mooncake-store/src/real_client.cpp
#	mooncake-store/src/real_client_main.cpp
# Conflicts:
#	mooncake-store/src/real_client.cpp
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @zhangzuo21, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Mooncake Store's SSD offloading path by introducing several optimizations. It improves I/O performance and concurrency through a new per-thread io_uring architecture, optimizes memory utilization by enabling early release of load buffers, and strengthens data integrity by protecting in-flight offloading tasks from eviction. These changes collectively aim to make the offloading mechanism more efficient, robust, and performant.

Highlights

  • Per-thread io_uring ring: Replaced per-file io_uring rings with a thread-local singleton SharedUringRing. This eliminates mutex contention, drastically reduces syscall overhead for file open/close, and improves NVMe queue depth utilization by batching I/O operations within a thread. It also enables fixed-buffer registration for zero-copy I/O.
  • Early load buffer release via RPC: Implemented a fire-and-forget RPC call (release_offload_buffer) that allows a receiver to notify the sender to release load buffers immediately after data transfer. This reclaims memory much earlier than the previous GC lease expiration mechanism, with GC acting as a safety fallback.
  • Eviction protection for offloading items: Added reference count protection for replicas in the offloading queue. The replica's reference count is incremented when it's queued for offloading and decremented upon task completion, revocation, or timeout, preventing premature eviction and potential data loss or use-after-free issues.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • mooncake-store/benchmarks/CMakeLists.txt
    • Added a new executable target for the file interface benchmark.
  • mooncake-store/benchmarks/file_interface_bench.cpp
    • Introduced a new benchmark utility to test various file I/O interfaces, including io_uring, O_DIRECT, and zero-copy options, with detailed performance statistics.
  • mooncake-store/include/aligned_client_buffer.hpp
    • Added a new AlignedClientBufferAllocator class, extending ClientBufferAllocator to provide 4096-byte aligned memory allocation for O_DIRECT I/O with io_uring.
  • mooncake-store/include/client_buffer.hpp
    • Modified ClientBufferAllocator to make its internal buffer management members protected, allowing derived classes like AlignedClientBufferAllocator to manage memory alignment.
  • mooncake-store/include/file_interface.h
    • Included mutex.h and liburing.h for io_uring support.
    • Declared the UringFile class, providing an io_uring-backed implementation of StorageFile with support for aligned I/O, batch reads, and buffer registration.
  • mooncake-store/include/file_storage.h
    • Defined BatchGetResult struct to include a batch_id and buffer pointers.
    • Updated FileStorage::BatchGet to return BatchGetResult.
    • Added ReleaseBuffer method to FileStorage for explicit buffer reclamation.
    • Changed BatchLoad parameter to a non-const reference.
    • Modified AllocatedBatch struct to include batch_id and updated client_buffer_allocated_batches_ to be an unordered_map keyed by batch_id.
  • mooncake-store/include/master_service.h
    • Defined OffloadingTask struct to track in-flight offloading operations.
    • Added an offloading_tasks map to MetadataShard to manage these tasks.
    • Changed PushOffloadingQueue parameter to a non-const reference.
  • mooncake-store/include/pyclient.h
    • Added release_offload_buffer method to ClientRequester for RPC-based buffer release.
  • mooncake-store/include/real_client.h
    • Added release_offload_buffer method to RealClient to handle RPC requests for releasing offload buffers.
  • mooncake-store/include/rpc_types.h
    • Modified BatchGetOffloadObjectResponse to include a batch_id.
  • mooncake-store/include/storage_backend.h
    • Added use_uring boolean flag to FileStorageConfig and StorageBackend to enable/disable io_uring.
    • Changed BatchLoad parameters in StorageBackendInterface, StorageBackendAdaptor, BucketStorageBackend, and OffsetAllocatorStorageBackend to non-const references.
    • Added a destructor, file cache, GetOrOpenFile method, and aligned I/O helper functions and buffer to BucketStorageBackend.
  • mooncake-store/src/CMakeLists.txt
    • Added logic to detect liburing and conditionally compile uring_file.cpp and define USE_URING.
    • Included aligned_client_buffer.cpp in the list of source files.
  • mooncake-store/src/aligned_client_buffer.cpp
    • Implemented the AlignedClientBufferAllocator class, providing aligned memory allocation using posix_memalign or huge pages, and ensuring proper cleanup.
  • mooncake-store/src/file_storage.cpp
    • Integrated AlignedClientBufferAllocator for the client_buffer_allocator_.
    • Registered the global client buffer with UringFile for io_uring fixed-buffer I/O if enabled.
    • Modified BatchGet to return BatchGetResult and adjust slice pointers for O_DIRECT aligned reads.
    • Updated ClientBufferGCThreadFunc to use the batch_id for logging and iterate over the unordered_map.
    • Implemented ReleaseBuffer to allow explicit, early release of allocated client buffers.
  • mooncake-store/src/master_service.cpp
    • Modified ClearInvalidHandles to also remove expired offloading tasks.
    • Updated PutEnd to increment replica reference counts and add offloading tasks to the offloading_tasks map when an object is queued for offloading.
    • Added logic to NotifyOffloadSuccess to decrement replica reference counts and clear corresponding offloading tasks.
    • Implemented logic in DiscardExpiredProcessingReplicas to decrement replica reference counts and remove expired offloading tasks.
  • mooncake-store/src/real_client.cpp
    • Implemented the release_offload_buffer RPC handler.
    • Modified batch_get_into_offload_object_internal to send a fire-and-forget release_offload_buffer RPC to the sender after data transfer, enabling early buffer reclamation.
  • mooncake-store/src/real_client_main.cpp
    • Registered the release_offload_buffer RPC handler with the server.
  • mooncake-store/src/storage_backend.cpp
    • Modified create_file to conditionally use O_DIRECT for reads and return UringFile instances when io_uring is enabled.
    • Updated StorageBackendAdaptor to pass the use_uring flag to the underlying StorageBackend.
    • Modified BucketStorageBackend::BatchLoad to utilize UringFile::read_aligned for zero-copy O_DIRECT reads and adjust slice pointers accordingly.
    • Updated BucketStorageBackend::WriteBucket to use UringFile::write_aligned, handle aligned buffer preparation, and perform datasync for durability.
    • Implemented GetOrOpenFile and ClearFileCache in BucketStorageBackend for efficient file handling.
    • Modified OffsetAllocatorStorageBackend::Init to use UringFile for its data file if io_uring is enabled.
  • mooncake-store/src/uring_file.cpp
    • Implemented the UringFile and SharedUringRing classes, providing a per-thread io_uring interface for file I/O.
    • Included support for fixed-buffer I/O, aligned reads/writes, batch reads, and datasync operations.
    • Implemented global buffer registration with madvise(MADV_NOHUGEPAGE) to ensure reliable pinning.
  • mooncake-store/tests/file_storage_test.cpp
    • Changed the batch_object parameter in FileStorageBatchLoad to a non-const reference to align with updated API.
Activity
  • No specific activity (comments, reviews, etc.) was provided in the context for this pull request.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces three significant optimizations for SSD offloading in Mooncake Store, including a shift to a per-thread io_uring model, early buffer release via RPC for improved memory efficiency, and eviction protection for offloading tasks. While these changes aim for performance and correctness, they also introduce several critical and high-severity security vulnerabilities. Specifically, a heap buffer overflow exists in the BatchGet path due to insufficient validation of user-supplied sizes and integer overflow during buffer allocation. Additionally, the new RPC mechanisms for early buffer release and offload notification lack proper authorization and ownership checks, leading to IDOR and replica injection vulnerabilities. These security issues must be addressed by implementing strict input validation, using non-predictable identifiers, and enforcing authorization checks for RPC callers. Furthermore, there are minor suggestions to improve code clarity in storage_backend.cpp related to std::unique_ptr usage.

Comment on lines 502 to +512
assert(sizes[i] <= kMaxSliceSize);
auto alloc_result = client_buffer_allocator_->allocate(sizes[i]);

// Allocate oversized buffer for O_DIRECT alignment:
// +4096 for aligning the ptr to 4096 boundary
// +4096 for aligned read tail padding (actual_offset may not be
// aligned)
size_t data_size = static_cast<size_t>(sizes[i]);
size_t alloc_size =
align_up(data_size, kDirectIOAlignment) + 2 * kDirectIOAlignment;

auto alloc_result = client_buffer_allocator_->allocate(alloc_size);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

The AllocateBatch function is vulnerable to a heap buffer overflow due to an integer overflow when calculating the allocation size. The sizes[i] parameter, which can be controlled by a remote attacker via the batch_get_offload_object RPC, is an int64_t. If a negative value (e.g., -1) is provided, the assert(sizes[i] <= kMaxSliceSize) on line 502 will pass (since -1 is less than any positive limit), but static_cast<size_t>(sizes[i]) will result in a very large value (e.g., 0xFFFFFFFFFFFFFFFF). The subsequent align_up call on line 510 will then overflow, resulting in a small alloc_size (e.g., 8192 bytes). However, the Slice object created on line 545 still uses the original huge data_size. When this slice is later used in BucketStorageBackend::BatchLoad (storage_backend.cpp:1478) with uring_file->read_aligned, the io_uring operation will attempt to read a massive amount of data into the small 8KB buffer, leading to a heap buffer overflow and potential remote code execution.

Comment on lines +581 to +593
bool FileStorage::ReleaseBuffer(uint64_t batch_id) {
MutexLocker locker(&client_buffer_mutex_);
auto it = client_buffer_allocated_batches_.find(batch_id);
if (it != client_buffer_allocated_batches_.end()) {
VLOG(1) << "Releasing buffer for batch_id: " << batch_id
<< " (transfer completed)";
client_buffer_allocated_batches_.erase(it);
return true;
}
VLOG(1) << "batch_id " << batch_id
<< " not found (may have been GC'd already)";
return false;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The ReleaseBuffer function (and the corresponding release_offload_buffer RPC handler in real_client.cpp) is vulnerable to an Insecure Direct Object Reference (IDOR) flaw. The batch_id is a sequential, predictable integer generated using a simple counter (next_batch_id_). Since the RPC handler does not perform any ownership or authorization checks, any client can guess batch_id values and release buffers belonging to other users. This can be used to cause a Denial of Service (DoS) by failing ongoing data transfers, or potentially lead to memory corruption if the released memory is re-allocated while still being accessed by the transfer engine.

Comment on lines +1262 to +1265
aligned_io_buffer_.reset(buf);
// Update the deleter to use free
aligned_io_buffer_ = std::unique_ptr<void, void (*)(void*)>(
buf, [](void* p) { free(p); });
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call to aligned_io_buffer_.reset(buf) on line 1262 is redundant because the unique_ptr is immediately reassigned on the next line. This reassignment correctly handles ownership and sets the custom deleter. Removing the unnecessary reset() call will make the code clearer.

        // Update the deleter to use free
        aligned_io_buffer_ = std::unique_ptr<void, void (*)(void*)>(
            buf, [](void* p) { free(p); });

Comment on lines +1976 to +1978
temp_buffer.reset(buf);
temp_buffer = std::unique_ptr<void, void (*)(void*)>(
buf, [](void* p) { free(p); });
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call to temp_buffer.reset(buf) on line 1976 is redundant. The unique_ptr is immediately reassigned on the next line, which correctly handles ownership and sets the custom deleter. To improve clarity, the reset() call can be removed.

            temp_buffer = std::unique_ptr<void, void (*)(void*)>(
                buf, [](void* p) { free(p); });

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Feb 26, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 4.02542% with 453 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
mooncake-store/src/uring_file.cpp 0.00% 366 Missing ⚠️
mooncake-store/src/file_storage.cpp 5.12% 37 Missing ⚠️
mooncake-store/src/master_service.cpp 37.14% 22 Missing ⚠️
mooncake-store/src/real_client.cpp 0.00% 19 Missing ⚠️
mooncake-store/include/rpc_types.h 0.00% 5 Missing ⚠️
mooncake-store/src/storage_backend.cpp 50.00% 3 Missing ⚠️
mooncake-store/src/real_client_main.cpp 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@ykwd ykwd requested a review from zhuxinjie-nz February 28, 2026 06:14
Copy link
Copy Markdown
Collaborator

@ykwd ykwd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ykwd ykwd merged commit e7f6bad into kvcache-ai:main Mar 12, 2026
26 of 30 checks passed
whn09 pushed a commit to whn09/Mooncake that referenced this pull request Apr 4, 2026
Co-authored-by: zhuxinjie-nz <240190801+zhuxinjie-nz@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants