[Store]Optimize uring file support for SSD offloading by zhangzuo21 · Pull Request #1562 · kvcache-ai/Mooncake

zhangzuo21 · 2026-02-26T06:37:03Z

Description

Type of Change

Types
- Bug fix
- [ ✅] New feature
  - Transfer Engine
  - [ ✅] Mooncake Store
  - Mooncake EP
  - Integration
  - P2P Store
  - Python Wheel
- Breaking change
- CI/CD
- Documentation update
- Other

How Has This Been Tested?

Checklist

[ ✅] I have performed a self-review of my own code.
[ ✅] I have formatted my own code using ./scripts/code_format.sh before submitting.
I have updated the documentation.
I have added tests to prove my changes are effective.

Summary

This PR contains three independent improvements to the offload (SSD-backed) path in Mooncake Store.

1. Per-thread io_uring ring (`uring_file.cpp`)

Previously each UringFile instance owned its own io_uring ring, which meant every file open/close incurred the overhead of io_uring_queue_init / io_uring_queue_exit system calls, and concurrent I/O across threads required mutex protection on the shared ring.

This PR replaces the per-file ring with a thread_local SharedUringRing singleton. Each thread initializes exactly one ring on first use and the ring lives for the lifetime of that thread. Key benefits:

Zero mutex contention: each ring is exclusively owned by its thread; no locking is needed for I/O operations.
Drastically reduced syscall overhead: ring init/exit no longer happens on every file open/close.
Better NVMe queue depth utilization: within a single thread, multiple SQEs can be batched and submitted together (up to QUEUE_DEPTH=32), exposing deeper device parallelism.
Fixed-buffer registration: the registered memory buffer is stored globally (g_buf) and lazily registered on each thread-local ring on first use, maintaining zero-copy I/O across all threads.

2. Early load buffer release via RPC (`real_client.cpp` — `batch_get_into_offload_object_internal`)

When an object is loaded from SSD and transferred from one client to another over the network, the sender holds a load buffer for that object until a GC lease TTL expires. Previously the buffer was not reclaimed until this timer fired.

This PR adds a fire-and-forget release_offload_buffer RPC call sent by the receiver immediately after the data transfer completes. The sender releases the load buffer upon receiving this RPC, reclaiming the memory much earlier than the GC deadline. If the RPC fails (e.g., network issue), the existing GC mechanism acts as a safety backstop.

3. Eviction protection for items in the offloading queue (`master_service.cpp`)

Items queued for offloading (SSD write) could previously be evicted from memory before the offloading task completed, leading to potential data loss or use-after-free bugs.

This PR increments the replica's reference count (inc_refcnt) when a replica is pushed onto the offloading queue, and decrements it (dec_refcnt) when the task completes, is revoked, or times out. The eviction logic already requires get_refcnt() == 0 before a replica can be evicted, so this change provides a clean, race-free protection window for all in-flight offloading tasks. The same refcount protection pattern is also applied symmetrically to COPY and MOVE replication tasks.

# Conflicts: # mooncake-store/src/real_client.cpp # mooncake-store/src/real_client_main.cpp

… engine write to transfer engine read.

# Conflicts: # mooncake-store/src/real_client.cpp

…t-offload-object-dev

…1500

gemini-code-assist · 2026-02-26T06:37:37Z

Summary of Changes

Hello @zhangzuo21, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Mooncake Store's SSD offloading path by introducing several optimizations. It improves I/O performance and concurrency through a new per-thread io_uring architecture, optimizes memory utilization by enabling early release of load buffers, and strengthens data integrity by protecting in-flight offloading tasks from eviction. These changes collectively aim to make the offloading mechanism more efficient, robust, and performant.

Highlights

Per-thread io_uring ring: Replaced per-file io_uring rings with a thread-local singleton SharedUringRing. This eliminates mutex contention, drastically reduces syscall overhead for file open/close, and improves NVMe queue depth utilization by batching I/O operations within a thread. It also enables fixed-buffer registration for zero-copy I/O.
Early load buffer release via RPC: Implemented a fire-and-forget RPC call (release_offload_buffer) that allows a receiver to notify the sender to release load buffers immediately after data transfer. This reclaims memory much earlier than the previous GC lease expiration mechanism, with GC acting as a safety fallback.
Eviction protection for offloading items: Added reference count protection for replicas in the offloading queue. The replica's reference count is incremented when it's queued for offloading and decremented upon task completion, revocation, or timeout, preventing premature eviction and potential data loss or use-after-free issues.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

mooncake-store/benchmarks/CMakeLists.txt
- Added a new executable target for the file interface benchmark.
mooncake-store/benchmarks/file_interface_bench.cpp
- Introduced a new benchmark utility to test various file I/O interfaces, including io_uring, O_DIRECT, and zero-copy options, with detailed performance statistics.
mooncake-store/include/aligned_client_buffer.hpp
- Added a new AlignedClientBufferAllocator class, extending ClientBufferAllocator to provide 4096-byte aligned memory allocation for O_DIRECT I/O with io_uring.
mooncake-store/include/client_buffer.hpp
- Modified ClientBufferAllocator to make its internal buffer management members protected, allowing derived classes like AlignedClientBufferAllocator to manage memory alignment.
mooncake-store/include/file_interface.h
- Included mutex.h and liburing.h for io_uring support.
- Declared the UringFile class, providing an io_uring-backed implementation of StorageFile with support for aligned I/O, batch reads, and buffer registration.
mooncake-store/include/file_storage.h
- Defined BatchGetResult struct to include a batch_id and buffer pointers.
- Updated FileStorage::BatchGet to return BatchGetResult.
- Added ReleaseBuffer method to FileStorage for explicit buffer reclamation.
- Changed BatchLoad parameter to a non-const reference.
- Modified AllocatedBatch struct to include batch_id and updated client_buffer_allocated_batches_ to be an unordered_map keyed by batch_id.
mooncake-store/include/master_service.h
- Defined OffloadingTask struct to track in-flight offloading operations.
- Added an offloading_tasks map to MetadataShard to manage these tasks.
- Changed PushOffloadingQueue parameter to a non-const reference.
mooncake-store/include/pyclient.h
- Added release_offload_buffer method to ClientRequester for RPC-based buffer release.
mooncake-store/include/real_client.h
- Added release_offload_buffer method to RealClient to handle RPC requests for releasing offload buffers.
mooncake-store/include/rpc_types.h
- Modified BatchGetOffloadObjectResponse to include a batch_id.
mooncake-store/include/storage_backend.h
- Added use_uring boolean flag to FileStorageConfig and StorageBackend to enable/disable io_uring.
- Changed BatchLoad parameters in StorageBackendInterface, StorageBackendAdaptor, BucketStorageBackend, and OffsetAllocatorStorageBackend to non-const references.
- Added a destructor, file cache, GetOrOpenFile method, and aligned I/O helper functions and buffer to BucketStorageBackend.
mooncake-store/src/CMakeLists.txt
- Added logic to detect liburing and conditionally compile uring_file.cpp and define USE_URING.
- Included aligned_client_buffer.cpp in the list of source files.
mooncake-store/src/aligned_client_buffer.cpp
- Implemented the AlignedClientBufferAllocator class, providing aligned memory allocation using posix_memalign or huge pages, and ensuring proper cleanup.
mooncake-store/src/file_storage.cpp
- Integrated AlignedClientBufferAllocator for the client_buffer_allocator_.
- Registered the global client buffer with UringFile for io_uring fixed-buffer I/O if enabled.
- Modified BatchGet to return BatchGetResult and adjust slice pointers for O_DIRECT aligned reads.
- Updated ClientBufferGCThreadFunc to use the batch_id for logging and iterate over the unordered_map.
- Implemented ReleaseBuffer to allow explicit, early release of allocated client buffers.
mooncake-store/src/master_service.cpp
- Modified ClearInvalidHandles to also remove expired offloading tasks.
- Updated PutEnd to increment replica reference counts and add offloading tasks to the offloading_tasks map when an object is queued for offloading.
- Added logic to NotifyOffloadSuccess to decrement replica reference counts and clear corresponding offloading tasks.
- Implemented logic in DiscardExpiredProcessingReplicas to decrement replica reference counts and remove expired offloading tasks.
mooncake-store/src/real_client.cpp
- Implemented the release_offload_buffer RPC handler.
- Modified batch_get_into_offload_object_internal to send a fire-and-forget release_offload_buffer RPC to the sender after data transfer, enabling early buffer reclamation.
mooncake-store/src/real_client_main.cpp
- Registered the release_offload_buffer RPC handler with the server.
mooncake-store/src/storage_backend.cpp
- Modified create_file to conditionally use O_DIRECT for reads and return UringFile instances when io_uring is enabled.
- Updated StorageBackendAdaptor to pass the use_uring flag to the underlying StorageBackend.
- Modified BucketStorageBackend::BatchLoad to utilize UringFile::read_aligned for zero-copy O_DIRECT reads and adjust slice pointers accordingly.
- Updated BucketStorageBackend::WriteBucket to use UringFile::write_aligned, handle aligned buffer preparation, and perform datasync for durability.
- Implemented GetOrOpenFile and ClearFileCache in BucketStorageBackend for efficient file handling.
- Modified OffsetAllocatorStorageBackend::Init to use UringFile for its data file if io_uring is enabled.
mooncake-store/src/uring_file.cpp
- Implemented the UringFile and SharedUringRing classes, providing a per-thread io_uring interface for file I/O.
- Included support for fixed-buffer I/O, aligned reads/writes, batch reads, and datasync operations.
- Implemented global buffer registration with madvise(MADV_NOHUGEPAGE) to ensure reliable pinning.
mooncake-store/tests/file_storage_test.cpp
- Changed the batch_object parameter in FileStorageBatchLoad to a non-const reference to align with updated API.

Activity

No specific activity (comments, reviews, etc.) was provided in the context for this pull request.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces three significant optimizations for SSD offloading in Mooncake Store, including a shift to a per-thread io_uring model, early buffer release via RPC for improved memory efficiency, and eviction protection for offloading tasks. While these changes aim for performance and correctness, they also introduce several critical and high-severity security vulnerabilities. Specifically, a heap buffer overflow exists in the BatchGet path due to insufficient validation of user-supplied sizes and integer overflow during buffer allocation. Additionally, the new RPC mechanisms for early buffer release and offload notification lack proper authorization and ownership checks, leading to IDOR and replica injection vulnerabilities. These security issues must be addressed by implementing strict input validation, using non-predictable identifiers, and enforcing authorization checks for RPC callers. Furthermore, there are minor suggestions to improve code clarity in storage_backend.cpp related to std::unique_ptr usage.

gemini-code-assist · 2026-02-26T06:40:53Z

        assert(sizes[i] <= kMaxSliceSize);
-        auto alloc_result = client_buffer_allocator_->allocate(sizes[i]);
+
+        // Allocate oversized buffer for O_DIRECT alignment:
+        //   +4096 for aligning the ptr to 4096 boundary
+        //   +4096 for aligned read tail padding (actual_offset may not be
+        //   aligned)
+        size_t data_size = static_cast<size_t>(sizes[i]);
+        size_t alloc_size =
+            align_up(data_size, kDirectIOAlignment) + 2 * kDirectIOAlignment;
+
+        auto alloc_result = client_buffer_allocator_->allocate(alloc_size);


The AllocateBatch function is vulnerable to a heap buffer overflow due to an integer overflow when calculating the allocation size. The sizes[i] parameter, which can be controlled by a remote attacker via the batch_get_offload_object RPC, is an int64_t. If a negative value (e.g., -1) is provided, the assert(sizes[i] <= kMaxSliceSize) on line 502 will pass (since -1 is less than any positive limit), but static_cast<size_t>(sizes[i]) will result in a very large value (e.g., 0xFFFFFFFFFFFFFFFF). The subsequent align_up call on line 510 will then overflow, resulting in a small alloc_size (e.g., 8192 bytes). However, the Slice object created on line 545 still uses the original huge data_size. When this slice is later used in BucketStorageBackend::BatchLoad (storage_backend.cpp:1478) with uring_file->read_aligned, the io_uring operation will attempt to read a massive amount of data into the small 8KB buffer, leading to a heap buffer overflow and potential remote code execution.

gemini-code-assist · 2026-02-26T06:40:53Z

+bool FileStorage::ReleaseBuffer(uint64_t batch_id) {
+    MutexLocker locker(&client_buffer_mutex_);
+    auto it = client_buffer_allocated_batches_.find(batch_id);
+    if (it != client_buffer_allocated_batches_.end()) {
+        VLOG(1) << "Releasing buffer for batch_id: " << batch_id
+                << " (transfer completed)";
+        client_buffer_allocated_batches_.erase(it);
+        return true;
+    }
+    VLOG(1) << "batch_id " << batch_id
+            << " not found (may have been GC'd already)";
+    return false;
+}


The ReleaseBuffer function (and the corresponding release_offload_buffer RPC handler in real_client.cpp) is vulnerable to an Insecure Direct Object Reference (IDOR) flaw. The batch_id is a sequential, predictable integer generated using a simple counter (next_batch_id_). Since the RPC handler does not perform any ownership or authorization checks, any client can guess batch_id values and release buffers belonging to other users. This can be used to cause a Denial of Service (DoS) by failing ongoing data transfers, or potentially lead to memory corruption if the released memory is re-allocated while still being accessed by the transfer engine.

gemini-code-assist · 2026-02-26T06:40:53Z

+        aligned_io_buffer_.reset(buf);
+        // Update the deleter to use free
+        aligned_io_buffer_ = std::unique_ptr<void, void (*)(void*)>(
+            buf, [](void* p) { free(p); });


The call to aligned_io_buffer_.reset(buf) on line 1262 is redundant because the unique_ptr is immediately reassigned on the next line. This reassignment correctly handles ownership and sets the custom deleter. Removing the unnecessary reset() call will make the code clearer.

// Update the deleter to use free aligned_io_buffer_ = std::unique_ptr<void, void (*)(void*)>( buf, [](void* p) { free(p); });

gemini-code-assist · 2026-02-26T06:40:54Z

+            temp_buffer.reset(buf);
+            temp_buffer = std::unique_ptr<void, void (*)(void*)>(
+                buf, [](void* p) { free(p); });


The call to temp_buffer.reset(buf) on line 1976 is redundant. The unique_ptr is immediately reassigned on the next line, which correctly handles ownership and sets the custom deleter. To improve clarity, the reset() call can be removed.

temp_buffer = std::unique_ptr<void, void (*)(void*)>( buf, [](void* p) { free(p); });

…tter-loadbuffer-management

codecov-commenter · 2026-02-26T07:28:41Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 4.02542% with 453 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
mooncake-store/src/uring_file.cpp	0.00%	366 Missing ⚠️
mooncake-store/src/file_storage.cpp	5.12%	37 Missing ⚠️
mooncake-store/src/master_service.cpp	37.14%	22 Missing ⚠️
mooncake-store/src/real_client.cpp	0.00%	19 Missing ⚠️
mooncake-store/include/rpc_types.h	0.00%	5 Missing ⚠️
mooncake-store/src/storage_backend.cpp	50.00%	3 Missing ⚠️
mooncake-store/src/real_client_main.cpp	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

…tter-loadbuffer-management

ykwd

LGTM

Co-authored-by: zhuxinjie-nz <240190801+zhuxinjie-nz@users.noreply.github.com>

zhuxinjie-nz and others added 30 commits December 12, 2025 13:46

feat(Store): Support get local ssd object

4f2c2bf

Merge remote-tracking branch 'origin/main' into get-offload-object-dev

33dd3bd

Merge branch 'main' into get-offload-object-dev

0ce2f48

# Conflicts: # mooncake-store/src/real_client.cpp # mooncake-store/src/real_client_main.cpp

feat(Store): Optimize get local SSD objects by changing from transfer…

597354e

… engine write to transfer engine read.

fix(Store): compilation issues in FileStorageTest.

af5fe64

Merge remote-tracking branch 'origin/main' into get-offload-object-dev

b568e2f

# Conflicts: # mooncake-store/src/real_client.cpp

fix(Store): resolve the conflicts

b2926d4

Merge remote-tracking branch 'origin/main' into get-offload-object-dev

c1ef296

fix(Store): compilation issues

4bd4a1a

fixed misplacint in rpc address

ec9f085

fixed missing values of results in batch_get_into_internal

e3bbaa8

fixed locking in gc thread

521da5c

format

0674a18

uring file opt

234ad88

Merge branch 'main' of https://github.com/kvcache-ai/Mooncake into ge…

8fa691d

…t-offload-object-dev

add notification for complete transfer of ssd objects

9f9ca29

add read mutex

aa47cad

remote temp file for registration

c93173f

Merge branch 'main' of https://github.com/kvcache-ai/Mooncake into pr…

e24d51d

…1500

add asio support for tent

832aa92

add protection from eviction during offloading

abb6122

bug fix

a2d3843

bug fix

92148ad

create a shared uring ring

9ed7d4e

Merge branch 'main' of https://github.com/kvcache-ai/Mooncake into pr…

d9af31e

…1500

revert changes in TE

a44fad3

add datasync after write

49c365d

format

6a80296

ring per thread

ed24821

Merge branch 'pr1500' into better-loadbuffer-management

576e8a1

zhangzuo21 added 3 commits February 26, 2026 11:42

better register buffer

9fa2206

bug fix

9c3cc15

format

76b35e9

zhangzuo21 requested review from XucSh, YiXR, stmatengss and ykwd as code owners February 26, 2026 06:37

github-actions Bot added run-ci Store labels Feb 26, 2026

gemini-code-assist Bot reviewed Feb 26, 2026

View reviewed changes

Merge branch 'main' of https://github.com/kvcache-ai/Mooncake into be…

089486d

…tter-loadbuffer-management

zhangzuo21 requested a review from maheshrbapatu February 26, 2026 06:55

ykwd requested a review from zhuxinjie-nz February 28, 2026 06:14

zhangzuo21 added 3 commits March 5, 2026 11:29

modified OffloadingTask::start_time to system_clock

8c50860

spell check

eed10b6

Merge branch 'main' of https://github.com/kvcache-ai/Mooncake into be…

ebbe987

…tter-loadbuffer-management

ykwd approved these changes Mar 12, 2026

View reviewed changes

ykwd merged commit e7f6bad into kvcache-ai:main Mar 12, 2026
26 of 30 checks passed

whn09 pushed a commit to whn09/Mooncake that referenced this pull request Apr 4, 2026

[Store]Optimize uring file support for SSD offloading (kvcache-ai#1562)

fba6255

Co-authored-by: zhuxinjie-nz <240190801+zhuxinjie-nz@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Store]Optimize uring file support for SSD offloading#1562

[Store]Optimize uring file support for SSD offloading#1562
ykwd merged 37 commits into
kvcache-ai:mainfrom
zhangzuo21:better-loadbuffer-management

zhangzuo21 commented Feb 26, 2026

Uh oh!

gemini-code-assist Bot commented Feb 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Feb 26, 2026

Uh oh!

gemini-code-assist Bot Feb 26, 2026

Uh oh!

gemini-code-assist Bot Feb 26, 2026

Uh oh!

gemini-code-assist Bot Feb 26, 2026

Uh oh!

codecov-commenter commented Feb 26, 2026 •

edited

Loading

Uh oh!

ykwd left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

zhangzuo21 commented Feb 26, 2026

Description

Type of Change

How Has This Been Tested?

Checklist

Summary

1. Per-thread io_uring ring (uring_file.cpp)

2. Early load buffer release via RPC (real_client.cpp — batch_get_into_offload_object_internal)

3. Eviction protection for items in the offloading queue (master_service.cpp)

Uh oh!

gemini-code-assist Bot commented Feb 26, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ykwd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

1. Per-thread io_uring ring (`uring_file.cpp`)

2. Early load buffer release via RPC (`real_client.cpp` — `batch_get_into_offload_object_internal`)

3. Eviction protection for items in the offloading queue (`master_service.cpp`)

codecov-commenter commented Feb 26, 2026 •

edited

Loading