Enhance Store and TransferEngine with health check, metrics, and NUMA support#1
Merged
Merged
Conversation
Co-authored-by: yuechen-sys <yuechen.bupt@gmail.com> Co-authored-by: liam <yzwliam@126.com>
Co-authored-by: youxiao <youxiao@huawei.com>
#1597) * [TransferEngine] Fix RDMA GID auto-discovery for IPv6 and reduce spurious errors Fixes #1593 and #1592 Changes: 1. Accept all RoCE v2 GIDs (both IPv4-mapped and pure IPv6) instead of only IPv4-mapped GIDs. This allows Transfer Engine to work in IPv6-only environments. 2. Stop querying GID indices on first failure instead of iterating through all 256 possible indices. This eliminates spurious 'Failed to query GID' error logs when devices have fewer GIDs. 3. Allow user-specified GIDs without network devices. When MC_GID_INDEX is explicitly set, log a warning but continue initialization instead of failing. Users setting this variable know their configuration. * update * update
Co-authored-by: youxiao <youxiao@huawei.com>
* [DOC] update readme * fix * fix
#1634) * add /metrics and /metrics/summary HTTP endpoints to RealClient * add integration tests for /metrics and /metrics/summary endpoints * apply clang-format to metrics endpoint handlers * add metrics data correctness test with put/get verification Merge the metrics endpoint test and a new transfer stats verification test into a single test case to avoid RealClient setup/teardown resource contention that caused segfaults with 8 sequential tests. The combined test verifies: - /metrics and /metrics/summary return 200 before any transfers - After put/get, Prometheus output contains write_bytes, read_bytes, put_latency_count, and get_latency_count - Summary output shows Put and Get sections --------- Co-authored-by: haodedu <haodedu@tencent.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
--------- Signed-off-by: shicanwei <shicanwei.scw@alibaba-inc.com>
[EP] Enable Fabric Mem only if MC_USE_NVLINK_IPC is explicitly set to zero
…as (#1626) DiscardedReplicas iterates all replicas and calls get_memory_buffer_size(), which logs "Invalid replica type: DISK" for non-memory replicas. In L3/DFS scenarios where DISK replicas are routinely discarded, this produces continuous error log spam. Guard the call with is_memory_replica() so only memory replicas contribute to the size counter. Functionally equivalent (non-memory types already return 0) but eliminates the misleading error logs. Also add LOCAL_DISK to the ReplicaType operator<< string map — it was missing, causing LOCAL_DISK to print as "UNKNOWN" in diagnostics. Fixes #1618
Co-authored-by: wanyue.wy <wanyue.wy@oceanbase.com>
Co-authored-by: litiantian.118 <litiantian.118@jd.com>
Co-authored-by: dongb0 <dongbozw@gmail.com>
Co-authored-by: chenkunjie0506 <chenkunjie1@huawei.com> Co-authored-by: ZhaoBaiwei <zhaobaiwei@huawei.com>
…connection logs after HA master failover (#1642) * set host_alive_detect_duration to 0 * resolve comment. * fix format --------- Co-authored-by: haodedu <haodedu@tencent.com>
Co-authored-by: zhuxinjie-nz <240190801+zhuxinjie-nz@users.noreply.github.com>
* Fix IPC offset corruption for sub-allocated GPU tensors in NVLink transport When framework caching allocators (PyTorch, etc.) sub-allocate tensors within larger cudaMalloc segments, cudaIpcGetMemHandle returns a handle for the entire segment, not the sub-allocation. The existing code stored the sub-allocation address as the buffer base, causing relocateShared- MemoryAddress to compute an incorrect offset on the target side. Small tensors (<1MB) sharing a segment would read from the wrong location. Fix: use cuMemGetAddressRange() to resolve the true cudaMalloc base address before registration. Register at segment granularity and skip duplicate registrations when multiple tensors share the same segment. * Address review: fix unregister and insert-before-confirm - unregisterLocalMemory: resolve base address via cuMemGetAddressRange before erasing from tracking set and metadata (matches register path) - registerLocalMemory: insert into registered_base_addrs_ only after addLocalMemoryBuffer succeeds to avoid inconsistent state on failure * Update mooncake-transfer-engine/src/transport/intranode_nvlink_transport/intranode_nvlink_transport.cpp --------- Co-authored-by: Ishan Dhanani <ishan@dhanani.dev> Co-authored-by: Teng Ma <teng-ma@linux.alibaba.com>
…NIC utilization
In Standalone mode the Real Client's global_segment was allocated via
aligned_alloc, landing all physical pages on a single NUMA node. The
TransferEngine's selectDevice then only picked NICs local to that NUMA,
leaving the remaining NICs idle.
This patch introduces per-NUMA-region segment binding:
- New allocator (allocate_buffer_numa_segments): mmap contiguous VMA,
mbind(MPOL_BIND) each region to a NIC-bearing NUMA node, prefault
with madvise(MADV_POPULATE_WRITE) — zero migration overhead.
- New location encoding ("segments:<page_size>:<n0>,<n1>,...") carried
in buffer.name through metadata, parsed by both local and remote
selectDevice to route each slice to the correct NUMA-local NIC.
- Activated via MC_SEGMENT_NUMA_NODES env var (e.g., "1,3,5,7").
Without it, the original allocation path is used (fully backward
compatible).
Signed-off-by: Xingrui Yi <yixingrui@linux.alibaba.com>
ibv_reg_mr() internally calls get_user_pages() which triggers page faults that respect the VMA-level mbind(MPOL_BIND) policy. The explicit madvise(MADV_POPULATE_WRITE) prefault was causing a redundant full-buffer traversal, doubling initialization time compared to the original path. Signed-off-by: Xingrui Yi <yixingrui@linux.alibaba.com>
Replace manual MC_SEGMENT_NUMA_NODES env var with automatic discovery from the TransferEngine's already-initialized Topology. NUMA-segmented allocation now activates automatically when: 1. Real Client runs in standalone mode (ipc_socket_path is set) 2. RDMA NICs span more than one NUMA node Add Client::GetNicNumaNodes() which extracts NIC-bearing NUMA nodes from the Topology matrix — zero sysfs access, zero extra discovery. Signed-off-by: Xingrui Yi <yixingrui@linux.alibaba.com>
NUMA-segmented mode only benefits physical RDMA NICs with real NUMA affinity. Virtual NICs (eRDMA) report numa_node=-1 and have no PCIe topology, making segmentation pointless. Gate the feature on protocol == "rdma" in addition to standalone mode check. Signed-off-by: Xingrui Yi <yixingrui@linux.alibaba.com>
- Use numa_num_possible_nodes() instead of numa_num_configured_nodes() to avoid bitmask overflow on systems with sparse NUMA node IDs - Fix mbind maxnode argument: mask->size (not mask->size + 1) - Add try-catch in parseSegmentsLocation to handle malformed strings and skip empty tokens from trailing/double commas - Merge duplicate hugepage_segment_ptrs_ branches for NUMA-segmented and hugepage allocations - Fix docstring: remove incorrect prefault description Signed-off-by: Xingrui Yi <yixingrui@linux.alibaba.com>
…t with same port (#1641) Co-authored-by: youxiao <youxiao@huawei.com>
--------- Co-authored-by: fatSheep <tzh2005t@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Module
mooncake-transfer-engine)mooncake-store)mooncake-ep)mooncake-integration)mooncake-p2p-store)mooncake-wheel)mooncake-pg)mooncake-rl)Type of Change
How Has This Been Tested?
Checklist
./scripts/code_format.shbefore submitting.