Skip to content

Enhance Store and TransferEngine with health check, metrics, and NUMA support#1

Merged
XingSong-Sun merged 31 commits into
XingSong-Sun:mainfrom
kvcache-ai:main
Mar 13, 2026
Merged

Enhance Store and TransferEngine with health check, metrics, and NUMA support#1
XingSong-Sun merged 31 commits into
XingSong-Sun:mainfrom
kvcache-ai:main

Conversation

@XingSong-Sun
Copy link
Copy Markdown
Owner

@XingSong-Sun XingSong-Sun commented Mar 13, 2026

Description

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • PyTorch Backend (mooncake-pg)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Other

How Has This Been Tested?

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

duhaode520 and others added 30 commits March 7, 2026 00:05
Co-authored-by: yuechen-sys <yuechen.bupt@gmail.com>
Co-authored-by: liam <yzwliam@126.com>
Co-authored-by: youxiao <youxiao@huawei.com>
#1597)

* [TransferEngine] Fix RDMA GID auto-discovery for IPv6 and reduce spurious errors

Fixes #1593 and #1592

Changes:
1. Accept all RoCE v2 GIDs (both IPv4-mapped and pure IPv6) instead of
   only IPv4-mapped GIDs. This allows Transfer Engine to work in
   IPv6-only environments.

2. Stop querying GID indices on first failure instead of iterating
   through all 256 possible indices. This eliminates spurious
   'Failed to query GID' error logs when devices have fewer GIDs.

3. Allow user-specified GIDs without network devices. When MC_GID_INDEX
   is explicitly set, log a warning but continue initialization instead
   of failing. Users setting this variable know their configuration.

* update

* update
Co-authored-by: youxiao <youxiao@huawei.com>
* [DOC] update readme

* fix

* fix
#1634)

* add /metrics and /metrics/summary HTTP endpoints to RealClient

* add integration tests for /metrics and /metrics/summary endpoints

* apply clang-format to metrics endpoint handlers

* add metrics data correctness test with put/get verification

Merge the metrics endpoint test and a new transfer stats verification
test into a single test case to avoid RealClient setup/teardown
resource contention that caused segfaults with 8 sequential tests.

The combined test verifies:
- /metrics and /metrics/summary return 200 before any transfers
- After put/get, Prometheus output contains write_bytes, read_bytes,
  put_latency_count, and get_latency_count
- Summary output shows Put and Get sections

---------

Co-authored-by: haodedu <haodedu@tencent.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
---------

Signed-off-by: shicanwei <shicanwei.scw@alibaba-inc.com>
[EP] Enable Fabric Mem only if MC_USE_NVLINK_IPC is explicitly set to zero
…as (#1626)

DiscardedReplicas iterates all replicas and calls get_memory_buffer_size(),
which logs "Invalid replica type: DISK" for non-memory replicas. In L3/DFS
scenarios where DISK replicas are routinely discarded, this produces
continuous error log spam.

Guard the call with is_memory_replica() so only memory replicas contribute
to the size counter. Functionally equivalent (non-memory types already
return 0) but eliminates the misleading error logs.

Also add LOCAL_DISK to the ReplicaType operator<< string map — it was
missing, causing LOCAL_DISK to print as "UNKNOWN" in diagnostics.

Fixes #1618
Co-authored-by: wanyue.wy <wanyue.wy@oceanbase.com>
Co-authored-by: litiantian.118 <litiantian.118@jd.com>
Co-authored-by: dongb0 <dongbozw@gmail.com>
Co-authored-by: chenkunjie0506 <chenkunjie1@huawei.com>
Co-authored-by: ZhaoBaiwei <zhaobaiwei@huawei.com>
…connection logs after HA master failover (#1642)

* set host_alive_detect_duration to 0

* resolve comment.

* fix format

---------

Co-authored-by: haodedu <haodedu@tencent.com>
Co-authored-by: zhuxinjie-nz <240190801+zhuxinjie-nz@users.noreply.github.com>
* Fix IPC offset corruption for sub-allocated GPU tensors in NVLink transport

When framework caching allocators (PyTorch, etc.) sub-allocate tensors
within larger cudaMalloc segments, cudaIpcGetMemHandle returns a handle
for the entire segment, not the sub-allocation. The existing code stored
the sub-allocation address as the buffer base, causing relocateShared-
MemoryAddress to compute an incorrect offset on the target side. Small
tensors (<1MB) sharing a segment would read from the wrong location.

Fix: use cuMemGetAddressRange() to resolve the true cudaMalloc base
address before registration. Register at segment granularity and skip
duplicate registrations when multiple tensors share the same segment.

* Address review: fix unregister and insert-before-confirm

- unregisterLocalMemory: resolve base address via cuMemGetAddressRange
  before erasing from tracking set and metadata (matches register path)
- registerLocalMemory: insert into registered_base_addrs_ only after
  addLocalMemoryBuffer succeeds to avoid inconsistent state on failure

* Update mooncake-transfer-engine/src/transport/intranode_nvlink_transport/intranode_nvlink_transport.cpp

---------

Co-authored-by: Ishan Dhanani <ishan@dhanani.dev>
Co-authored-by: Teng Ma <teng-ma@linux.alibaba.com>
…NIC utilization

In Standalone mode the Real Client's global_segment was allocated via
aligned_alloc, landing all physical pages on a single NUMA node. The
TransferEngine's selectDevice then only picked NICs local to that NUMA,
leaving the remaining NICs idle.

This patch introduces per-NUMA-region segment binding:
- New allocator (allocate_buffer_numa_segments): mmap contiguous VMA,
  mbind(MPOL_BIND) each region to a NIC-bearing NUMA node, prefault
  with madvise(MADV_POPULATE_WRITE) — zero migration overhead.
- New location encoding ("segments:<page_size>:<n0>,<n1>,...") carried
  in buffer.name through metadata, parsed by both local and remote
  selectDevice to route each slice to the correct NUMA-local NIC.
- Activated via MC_SEGMENT_NUMA_NODES env var (e.g., "1,3,5,7").
  Without it, the original allocation path is used (fully backward
  compatible).

Signed-off-by: Xingrui Yi <yixingrui@linux.alibaba.com>
ibv_reg_mr() internally calls get_user_pages() which triggers page
faults that respect the VMA-level mbind(MPOL_BIND) policy. The explicit
madvise(MADV_POPULATE_WRITE) prefault was causing a redundant full-buffer
traversal, doubling initialization time compared to the original path.

Signed-off-by: Xingrui Yi <yixingrui@linux.alibaba.com>
Replace manual MC_SEGMENT_NUMA_NODES env var with automatic discovery
from the TransferEngine's already-initialized Topology. NUMA-segmented
allocation now activates automatically when:
  1. Real Client runs in standalone mode (ipc_socket_path is set)
  2. RDMA NICs span more than one NUMA node

Add Client::GetNicNumaNodes() which extracts NIC-bearing NUMA nodes
from the Topology matrix — zero sysfs access, zero extra discovery.

Signed-off-by: Xingrui Yi <yixingrui@linux.alibaba.com>
NUMA-segmented mode only benefits physical RDMA NICs with real NUMA
affinity. Virtual NICs (eRDMA) report numa_node=-1 and have no
PCIe topology, making segmentation pointless. Gate the feature on
protocol == "rdma" in addition to standalone mode check.

Signed-off-by: Xingrui Yi <yixingrui@linux.alibaba.com>
- Use numa_num_possible_nodes() instead of numa_num_configured_nodes()
  to avoid bitmask overflow on systems with sparse NUMA node IDs
- Fix mbind maxnode argument: mask->size (not mask->size + 1)
- Add try-catch in parseSegmentsLocation to handle malformed strings
  and skip empty tokens from trailing/double commas
- Merge duplicate hugepage_segment_ptrs_ branches for NUMA-segmented
  and hugepage allocations
- Fix docstring: remove incorrect prefault description

Signed-off-by: Xingrui Yi <yixingrui@linux.alibaba.com>
…t with same port (#1641)

Co-authored-by: youxiao <youxiao@huawei.com>
---------

Co-authored-by: fatSheep <tzh2005t@gmail.com>
@XingSong-Sun XingSong-Sun merged commit a30dfc7 into XingSong-Sun:main Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.