-
Notifications
You must be signed in to change notification settings - Fork 734
fix: fix heterogeneous rdma transport error #1657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
a30dfc7
c79a4de
4b9f292
7041ac9
7a68c81
7e3f8e5
eb740f8
63d2021
1c2c945
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -37,12 +37,13 @@ class RdmaContext; | |
| class RdmaEndPoint; | ||
| class TransferMetadata; | ||
| class WorkerPool; | ||
| class HeterogeneousRdmaTransport; | ||
|
|
||
| class RdmaTransport : public Transport { | ||
| friend class RdmaContext; | ||
| friend class RdmaEndPoint; | ||
| friend class WorkerPool; | ||
|
|
||
| friend class HeterogeneousRdmaTransport; | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you clarify which private member(s) this is needed for? If none currently, this should be removed —
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. HeterogeneousRdmaTransport needs to access the private members submitTransfer, submitTransferTask, and getTransferStatus in RdmaTransport. If friend is not added, it will result in a compilation error.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation — that makes sense for the original code. However, PR #1663 has since been merged, which moved submitTransfer and related methods from private to public. So the friend class declaration is no longer needed. Could you rebase onto the latest main and drop that change?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The corresponding modifications have been submitted. |
||
| public: | ||
| using BufferDesc = TransferMetadata::BufferDesc; | ||
| using SegmentDesc = TransferMetadata::SegmentDesc; | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,13 +4,21 @@ | |
| namespace mooncake { | ||
|
|
||
| namespace { | ||
| // ACL memory location types for aclrtPtrAttributes::location.type: | ||
| // 0 = ACL host memory, 1 = device memory, 2 = regular CPU memory (malloc) | ||
| static constexpr int kDeviceMemoryLocationType = 1; | ||
|
|
||
| bool isCpuMemory(void *addr) { | ||
| aclrtPtrAttributes attributes{}; | ||
| if (int ret = aclrtPointerGetAttributes(addr, &attributes)) { | ||
| LOG(ERROR) << "aclrtPointrtGetAttributes error, ret: " << ret; | ||
| return false; | ||
| // If ACL cannot identify the pointer, it is not ACL-managed device | ||
| // memory, so treat it as regular CPU memory. | ||
| LOG(WARNING) << "aclrtPointerGetAttributes failed for addr " << addr | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. reformat the example file string concatenation
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done. |
||
| << ", ret: " << ret | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The logic for determining if a memory address belongs to the CPU has been updated. The previous implementation only checked for |
||
| << ". Treating as CPU memory."; | ||
| return true; | ||
| } | ||
| return (attributes.location.type == 0); | ||
| return (attributes.location.type != kDeviceMemoryLocationType); | ||
| } | ||
| } // namespace | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string concatenation for adding the
npu:0entry has inconsistent formatting, which makes the code harder to read. Please reformat this section for better clarity and consistency with the surrounding code.