[TransferEngine] Fix RDMA GID auto-discovery for IPv6 and reduce spur…#1597
Conversation
…ious errors Fixes #1593 and #1592 Changes: 1. Accept all RoCE v2 GIDs (both IPv4-mapped and pure IPv6) instead of only IPv4-mapped GIDs. This allows Transfer Engine to work in IPv6-only environments. 2. Stop querying GID indices on first failure instead of iterating through all 256 possible indices. This eliminates spurious 'Failed to query GID' error logs when devices have fewer GIDs. 3. Allow user-specified GIDs without network devices. When MC_GID_INDEX is explicitly set, log a warning but continue initialization instead of failing. Users setting this variable know their configuration.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the robustness and compatibility of the Transfer Engine's RDMA GID discovery mechanism. It broadens support for IPv6 network configurations, streamlines the GID querying process to minimize error logging, and improves flexibility for users who manually configure GIDs, ensuring the system can operate effectively in a wider range of network setups. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces three important fixes for RDMA GID handling in the Transfer Engine. First, it extends RoCE v2 GID support to include pure IPv6 addresses, enabling functionality in IPv6-only environments. Second, it optimizes GID discovery by stopping the search upon the first failure, which reduces spurious error logs on devices with fewer GIDs. Third, it allows user-specified GIDs to be used even without an associated network device, providing more flexibility for advanced user configurations while issuing a warning. The changes are well-implemented, targeted, and improve the robustness and usability of the RDMA transport. The code modifications are correct and align with the stated goals.
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
|
||
| if ((ipv6_addr_v4mapped((struct in6_addr *)gid_entry.gid.raw) && | ||
| gid_entry.gid_type == IBV_GID_TYPE_ROCE_V2) || | ||
| if (gid_entry.gid_type == IBV_GID_TYPE_ROCE_V2 || |
There was a problem hiding this comment.
We originally add this restriction to avoid finding loopback GID if the network only supports IPv4. So it's better to find V4 GID first, the fallback to V6.
|
@stmatengss I will merge it after fixing the code format issue. |
|
@copilot fix code format issue |
|
@stmatengss I've opened a new pull request, #1625, to work on those changes. Once the pull request is ready, I'll request review from you. |
|
After changing code format, I think it will be LGTM. |
|
Fixed. Fall-back path: ipv4 => ipv4 w/o network => ipv6 => ipv6 w/o network. PTAL. @alogfans |
kvcache-ai#1597) * [TransferEngine] Fix RDMA GID auto-discovery for IPv6 and reduce spurious errors Fixes kvcache-ai#1593 and kvcache-ai#1592 Changes: 1. Accept all RoCE v2 GIDs (both IPv4-mapped and pure IPv6) instead of only IPv4-mapped GIDs. This allows Transfer Engine to work in IPv6-only environments. 2. Stop querying GID indices on first failure instead of iterating through all 256 possible indices. This eliminates spurious 'Failed to query GID' error logs when devices have fewer GIDs. 3. Allow user-specified GIDs without network devices. When MC_GID_INDEX is explicitly set, log a warning but continue initialization instead of failing. Users setting this variable know their configuration. * update * update
…ious errors
Fixes #1593 and #1592
Changes:
Accept all RoCE v2 GIDs (both IPv4-mapped and pure IPv6) instead of only IPv4-mapped GIDs. This allows Transfer Engine to work in IPv6-only environments.
Stop querying GID indices on first failure instead of iterating through all 256 possible indices. This eliminates spurious 'Failed to query GID' error logs when devices have fewer GIDs.
Allow user-specified GIDs without network devices. When MC_GID_INDEX is explicitly set, log a warning but continue initialization instead of failing. Users setting this variable know their configuration.
Description
Module
mooncake-transfer-engine)mooncake-store)mooncake-ep)mooncake-integration)mooncake-p2p-store)mooncake-wheel)mooncake-pg)mooncake-rl)Type of Change
How Has This Been Tested?
Checklist
./scripts/code_format.shbefore submitting.