[Store]Optimize uring file support for SSD offloading#1562
Conversation
# Conflicts: # mooncake-store/src/real_client.cpp # mooncake-store/src/real_client_main.cpp
… engine write to transfer engine read.
# Conflicts: # mooncake-store/src/real_client.cpp
…t-offload-object-dev
Summary of ChangesHello @zhangzuo21, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Mooncake Store's SSD offloading path by introducing several optimizations. It improves I/O performance and concurrency through a new per-thread io_uring architecture, optimizes memory utilization by enabling early release of load buffers, and strengthens data integrity by protecting in-flight offloading tasks from eviction. These changes collectively aim to make the offloading mechanism more efficient, robust, and performant. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces three significant optimizations for SSD offloading in Mooncake Store, including a shift to a per-thread io_uring model, early buffer release via RPC for improved memory efficiency, and eviction protection for offloading tasks. While these changes aim for performance and correctness, they also introduce several critical and high-severity security vulnerabilities. Specifically, a heap buffer overflow exists in the BatchGet path due to insufficient validation of user-supplied sizes and integer overflow during buffer allocation. Additionally, the new RPC mechanisms for early buffer release and offload notification lack proper authorization and ownership checks, leading to IDOR and replica injection vulnerabilities. These security issues must be addressed by implementing strict input validation, using non-predictable identifiers, and enforcing authorization checks for RPC callers. Furthermore, there are minor suggestions to improve code clarity in storage_backend.cpp related to std::unique_ptr usage.
| assert(sizes[i] <= kMaxSliceSize); | ||
| auto alloc_result = client_buffer_allocator_->allocate(sizes[i]); | ||
|
|
||
| // Allocate oversized buffer for O_DIRECT alignment: | ||
| // +4096 for aligning the ptr to 4096 boundary | ||
| // +4096 for aligned read tail padding (actual_offset may not be | ||
| // aligned) | ||
| size_t data_size = static_cast<size_t>(sizes[i]); | ||
| size_t alloc_size = | ||
| align_up(data_size, kDirectIOAlignment) + 2 * kDirectIOAlignment; | ||
|
|
||
| auto alloc_result = client_buffer_allocator_->allocate(alloc_size); |
There was a problem hiding this comment.
The AllocateBatch function is vulnerable to a heap buffer overflow due to an integer overflow when calculating the allocation size. The sizes[i] parameter, which can be controlled by a remote attacker via the batch_get_offload_object RPC, is an int64_t. If a negative value (e.g., -1) is provided, the assert(sizes[i] <= kMaxSliceSize) on line 502 will pass (since -1 is less than any positive limit), but static_cast<size_t>(sizes[i]) will result in a very large value (e.g., 0xFFFFFFFFFFFFFFFF). The subsequent align_up call on line 510 will then overflow, resulting in a small alloc_size (e.g., 8192 bytes). However, the Slice object created on line 545 still uses the original huge data_size. When this slice is later used in BucketStorageBackend::BatchLoad (storage_backend.cpp:1478) with uring_file->read_aligned, the io_uring operation will attempt to read a massive amount of data into the small 8KB buffer, leading to a heap buffer overflow and potential remote code execution.
| bool FileStorage::ReleaseBuffer(uint64_t batch_id) { | ||
| MutexLocker locker(&client_buffer_mutex_); | ||
| auto it = client_buffer_allocated_batches_.find(batch_id); | ||
| if (it != client_buffer_allocated_batches_.end()) { | ||
| VLOG(1) << "Releasing buffer for batch_id: " << batch_id | ||
| << " (transfer completed)"; | ||
| client_buffer_allocated_batches_.erase(it); | ||
| return true; | ||
| } | ||
| VLOG(1) << "batch_id " << batch_id | ||
| << " not found (may have been GC'd already)"; | ||
| return false; | ||
| } |
There was a problem hiding this comment.
The ReleaseBuffer function (and the corresponding release_offload_buffer RPC handler in real_client.cpp) is vulnerable to an Insecure Direct Object Reference (IDOR) flaw. The batch_id is a sequential, predictable integer generated using a simple counter (next_batch_id_). Since the RPC handler does not perform any ownership or authorization checks, any client can guess batch_id values and release buffers belonging to other users. This can be used to cause a Denial of Service (DoS) by failing ongoing data transfers, or potentially lead to memory corruption if the released memory is re-allocated while still being accessed by the transfer engine.
| aligned_io_buffer_.reset(buf); | ||
| // Update the deleter to use free | ||
| aligned_io_buffer_ = std::unique_ptr<void, void (*)(void*)>( | ||
| buf, [](void* p) { free(p); }); |
There was a problem hiding this comment.
The call to aligned_io_buffer_.reset(buf) on line 1262 is redundant because the unique_ptr is immediately reassigned on the next line. This reassignment correctly handles ownership and sets the custom deleter. Removing the unnecessary reset() call will make the code clearer.
// Update the deleter to use free
aligned_io_buffer_ = std::unique_ptr<void, void (*)(void*)>(
buf, [](void* p) { free(p); });| temp_buffer.reset(buf); | ||
| temp_buffer = std::unique_ptr<void, void (*)(void*)>( | ||
| buf, [](void* p) { free(p); }); |
There was a problem hiding this comment.
The call to temp_buffer.reset(buf) on line 1976 is redundant. The unique_ptr is immediately reassigned on the next line, which correctly handles ownership and sets the custom deleter. To improve clarity, the reset() call can be removed.
temp_buffer = std::unique_ptr<void, void (*)(void*)>(
buf, [](void* p) { free(p); });…tter-loadbuffer-management
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Co-authored-by: zhuxinjie-nz <240190801+zhuxinjie-nz@users.noreply.github.com>
Description
Type of Change
How Has This Been Tested?
Checklist
./scripts/code_format.shbefore submitting.Summary
This PR contains three independent improvements to the offload (SSD-backed) path in Mooncake Store.
1. Per-thread io_uring ring (
uring_file.cpp)Previously each
UringFileinstance owned its ownio_uringring, which meant every file open/close incurred the overhead ofio_uring_queue_init/io_uring_queue_exitsystem calls, and concurrent I/O across threads required mutex protection on the shared ring.This PR replaces the per-file ring with a
thread_local SharedUringRingsingleton. Each thread initializes exactly one ring on first use and the ring lives for the lifetime of that thread. Key benefits:QUEUE_DEPTH=32), exposing deeper device parallelism.g_buf) and lazily registered on each thread-local ring on first use, maintaining zero-copy I/O across all threads.2. Early load buffer release via RPC (
real_client.cpp—batch_get_into_offload_object_internal)When an object is loaded from SSD and transferred from one client to another over the network, the sender holds a load buffer for that object until a GC lease TTL expires. Previously the buffer was not reclaimed until this timer fired.
This PR adds a fire-and-forget
release_offload_bufferRPC call sent by the receiver immediately after the data transfer completes. The sender releases the load buffer upon receiving this RPC, reclaiming the memory much earlier than the GC deadline. If the RPC fails (e.g., network issue), the existing GC mechanism acts as a safety backstop.3. Eviction protection for items in the offloading queue (
master_service.cpp)Items queued for offloading (SSD write) could previously be evicted from memory before the offloading task completed, leading to potential data loss or use-after-free bugs.
This PR increments the replica's reference count (
inc_refcnt) when a replica is pushed onto the offloading queue, and decrements it (dec_refcnt) when the task completes, is revoked, or times out. The eviction logic already requiresget_refcnt() == 0before a replica can be evicted, so this change provides a clean, race-free protection window for all in-flight offloading tasks. The same refcount protection pattern is also applied symmetrically to COPY and MOVE replication tasks.