Enhance Store and TransferEngine with health check, metrics, and NUMA support by XingSong-Sun · Pull Request #1 · XingSong-Sun/Mooncake

XingSong-Sun · 2026-03-13T07:15:28Z

Description

Module

Type of Change

How Has This Been Tested?

Checklist

I have performed a self-review of my own code.
I have formatted my own code using ./scripts/code_format.sh before submitting.
I have updated the documentation.
I have added tests to prove my changes are effective.

…1606)

Co-authored-by: yuechen-sys <yuechen.bupt@gmail.com> Co-authored-by: liam <yzwliam@126.com>

Co-authored-by: youxiao <youxiao@huawei.com>

#1597) * [TransferEngine] Fix RDMA GID auto-discovery for IPv6 and reduce spurious errors Fixes #1593 and #1592 Changes: 1. Accept all RoCE v2 GIDs (both IPv4-mapped and pure IPv6) instead of only IPv4-mapped GIDs. This allows Transfer Engine to work in IPv6-only environments. 2. Stop querying GID indices on first failure instead of iterating through all 256 possible indices. This eliminates spurious 'Failed to query GID' error logs when devices have fewer GIDs. 3. Allow user-specified GIDs without network devices. When MC_GID_INDEX is explicitly set, log a warning but continue initialization instead of failing. Users setting this variable know their configuration. * update * update

Co-authored-by: youxiao <youxiao@huawei.com>

* [DOC] update readme * fix * fix

#1634) * add /metrics and /metrics/summary HTTP endpoints to RealClient * add integration tests for /metrics and /metrics/summary endpoints * apply clang-format to metrics endpoint handlers * add metrics data correctness test with put/get verification Merge the metrics endpoint test and a new transfer stats verification test into a single test case to avoid RealClient setup/teardown resource contention that caused segfaults with 8 sequential tests. The combined test verifies: - /metrics and /metrics/summary return 200 before any transfers - After put/get, Prometheus output contains write_bytes, read_bytes, put_latency_count, and get_latency_count - Summary output shows Put and Get sections --------- Co-authored-by: haodedu <haodedu@tencent.com>

… zero

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

--------- Signed-off-by: shicanwei <shicanwei.scw@alibaba-inc.com>

[EP] Enable Fabric Mem only if MC_USE_NVLINK_IPC is explicitly set to zero

…as (#1626) DiscardedReplicas iterates all replicas and calls get_memory_buffer_size(), which logs "Invalid replica type: DISK" for non-memory replicas. In L3/DFS scenarios where DISK replicas are routinely discarded, this produces continuous error log spam. Guard the call with is_memory_replica() so only memory replicas contribute to the size counter. Functionally equivalent (non-memory types already return 0) but eliminates the misleading error logs. Also add LOCAL_DISK to the ReplicaType operator<< string map — it was missing, causing LOCAL_DISK to print as "UNKNOWN" in diagnostics. Fixes #1618

Co-authored-by: wanyue.wy <wanyue.wy@oceanbase.com>

Co-authored-by: litiantian.118 <litiantian.118@jd.com>

Co-authored-by: dongb0 <dongbozw@gmail.com>

Co-authored-by: chenkunjie0506 <chenkunjie1@huawei.com> Co-authored-by: ZhaoBaiwei <zhaobaiwei@huawei.com>

…connection logs after HA master failover (#1642) * set host_alive_detect_duration to 0 * resolve comment. * fix format --------- Co-authored-by: haodedu <haodedu@tencent.com>

Co-authored-by: zhuxinjie-nz <240190801+zhuxinjie-nz@users.noreply.github.com>

* Fix IPC offset corruption for sub-allocated GPU tensors in NVLink transport When framework caching allocators (PyTorch, etc.) sub-allocate tensors within larger cudaMalloc segments, cudaIpcGetMemHandle returns a handle for the entire segment, not the sub-allocation. The existing code stored the sub-allocation address as the buffer base, causing relocateShared- MemoryAddress to compute an incorrect offset on the target side. Small tensors (<1MB) sharing a segment would read from the wrong location. Fix: use cuMemGetAddressRange() to resolve the true cudaMalloc base address before registration. Register at segment granularity and skip duplicate registrations when multiple tensors share the same segment. * Address review: fix unregister and insert-before-confirm - unregisterLocalMemory: resolve base address via cuMemGetAddressRange before erasing from tracking set and metadata (matches register path) - registerLocalMemory: insert into registered_base_addrs_ only after addLocalMemoryBuffer succeeds to avoid inconsistent state on failure * Update mooncake-transfer-engine/src/transport/intranode_nvlink_transport/intranode_nvlink_transport.cpp --------- Co-authored-by: Ishan Dhanani <ishan@dhanani.dev> Co-authored-by: Teng Ma <teng-ma@linux.alibaba.com>

…NIC utilization In Standalone mode the Real Client's global_segment was allocated via aligned_alloc, landing all physical pages on a single NUMA node. The TransferEngine's selectDevice then only picked NICs local to that NUMA, leaving the remaining NICs idle. This patch introduces per-NUMA-region segment binding: - New allocator (allocate_buffer_numa_segments): mmap contiguous VMA, mbind(MPOL_BIND) each region to a NIC-bearing NUMA node, prefault with madvise(MADV_POPULATE_WRITE) — zero migration overhead. - New location encoding ("segments:<page_size>:<n0>,<n1>,...") carried in buffer.name through metadata, parsed by both local and remote selectDevice to route each slice to the correct NUMA-local NIC. - Activated via MC_SEGMENT_NUMA_NODES env var (e.g., "1,3,5,7"). Without it, the original allocation path is used (fully backward compatible). Signed-off-by: Xingrui Yi <yixingrui@linux.alibaba.com>

ibv_reg_mr() internally calls get_user_pages() which triggers page faults that respect the VMA-level mbind(MPOL_BIND) policy. The explicit madvise(MADV_POPULATE_WRITE) prefault was causing a redundant full-buffer traversal, doubling initialization time compared to the original path. Signed-off-by: Xingrui Yi <yixingrui@linux.alibaba.com>

Replace manual MC_SEGMENT_NUMA_NODES env var with automatic discovery from the TransferEngine's already-initialized Topology. NUMA-segmented allocation now activates automatically when: 1. Real Client runs in standalone mode (ipc_socket_path is set) 2. RDMA NICs span more than one NUMA node Add Client::GetNicNumaNodes() which extracts NIC-bearing NUMA nodes from the Topology matrix — zero sysfs access, zero extra discovery. Signed-off-by: Xingrui Yi <yixingrui@linux.alibaba.com>

NUMA-segmented mode only benefits physical RDMA NICs with real NUMA affinity. Virtual NICs (eRDMA) report numa_node=-1 and have no PCIe topology, making segmentation pointless. Gate the feature on protocol == "rdma" in addition to standalone mode check. Signed-off-by: Xingrui Yi <yixingrui@linux.alibaba.com>

- Use numa_num_possible_nodes() instead of numa_num_configured_nodes() to avoid bitmask overflow on systems with sparse NUMA node IDs - Fix mbind maxnode argument: mask->size (not mask->size + 1) - Add try-catch in parseSegmentsLocation to handle malformed strings and skip empty tokens from trailing/double commas - Merge duplicate hugepage_segment_ptrs_ branches for NUMA-segmented and hugepage allocations - Fix docstring: remove incorrect prefault description Signed-off-by: Xingrui Yi <yixingrui@linux.alibaba.com>

…1616)

…t with same port (#1641) Co-authored-by: youxiao <youxiao@huawei.com>

--------- Co-authored-by: fatSheep <tzh2005t@gmail.com>

duhaode520 and others added 30 commits March 7, 2026 00:05

[Store] Add health check API for Client with HTTP /health endpoint (#…

ea8f1b9

…1606)

[PG] Share P2PProxy/ConnectionPoller threads across backends. (#1607)

2456b0c

Co-authored-by: yuechen-sys <yuechen.bupt@gmail.com> Co-authored-by: liam <yzwliam@126.com>

[EP] In-place Member Update (#1630)

59ae11f

[PG] Remove CPU-only backend tests from CI (#1628)

6462780

[EP] Fix EP buffer allocation for MNNVL clusters (#1629)

7a96121

remove target segment desc cache when disconnect (#1624)

834c416

Co-authored-by: youxiao <youxiao@huawei.com>

change fabric mem alloc (#1623)

4c4f1e3

Co-authored-by: youxiao <youxiao@huawei.com>

[DOC] update news and badges readme (#1636)

a8fde4c

* [DOC] update readme * fix * fix

[EP] Enable Fabric Mem only if MC_USE_NVLINK_IPC is explicitly set to…

ea59594

… zero

Update mooncake-ep/src/mooncake_ep_buffer.cpp

9349356

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Format?

5a47ce0

[CI] skip integration test for non-core file changes (#1609)

1bc3c65

--------- Signed-off-by: shicanwei <shicanwei.scw@alibaba-inc.com>

Merge pull request #1637 from kvcache-ai/sunxun/fix

9968d05

[EP] Enable Fabric Mem only if MC_USE_NVLINK_IPC is explicitly set to zero

[Store] bugfix: fix signal ignore (#1632)

3194a7b

Co-authored-by: wanyue.wy <wanyue.wy@oceanbase.com>

add endpoint set false in endpoint_store.cpp (#1643)

ff6274f

Co-authored-by: litiantian.118 <litiantian.118@jd.com>

[Store] Add allocation strategy benchmark (#1587)

97987ec

Co-authored-by: dongb0 <dongbozw@gmail.com>

[CI] Add CI workflow on ASCEND platform (#1640)

8887c35

Co-authored-by: chenkunjie0506 <chenkunjie1@huawei.com> Co-authored-by: ZhaoBaiwei <zhaobaiwei@huawei.com>

[Store][HA] bugfix: disable client_pool alive_detect to stop stale re…

e43dd7d

…connection logs after HA master failover (#1642) * set host_alive_detect_duration to 0 * resolve comment. * fix format --------- Co-authored-by: haodedu <haodedu@tencent.com>

[Store]Optimize uring file support for SSD offloading (#1562)

e7f6bad

Co-authored-by: zhuxinjie-nz <240190801+zhuxinjie-nz@users.noreply.github.com>

[EP] Move EP/PG wheel-building logic from build_wheel.sh into CMake (#…

9bd773e

…1616)

[TE] bugfix: add retry logic for ascend direct: in case remote restar…

911fd2d

…t with same port (#1641) Co-authored-by: youxiao <youxiao@huawei.com>

[Store] Fix Ctrl-C hang in both Python and C++ client processes (#1620)

8e40563

--------- Co-authored-by: fatSheep <tzh2005t@gmail.com>

XingSong-Sun merged commit a30dfc7 into XingSong-Sun:main Mar 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance Store and TransferEngine with health check, metrics, and NUMA support#1

Enhance Store and TransferEngine with health check, metrics, and NUMA support#1
XingSong-Sun merged 31 commits into
XingSong-Sun:mainfrom
kvcache-ai:main

XingSong-Sun commented Mar 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

Conversation

XingSong-Sun commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Module

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

XingSong-Sun commented Mar 13, 2026 •

edited

Loading