Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -159,8 +159,9 @@ std::string loadNicPriorityMatrix() {
"], []], "
" \"cpu:1\": [[" +
device_names +
"], []], "
" \"cuda:0\": [[" +
"], []], " " \"npu:0\": [["
+ device_names +
"], []], " " \"cuda:0\": [[" +
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The string concatenation for adding the npu:0 entry has inconsistent formatting, which makes the code harder to read. Please reformat this section for better clarity and consistency with the surrounding code.

Suggested change
"], []], " " \"npu:0\": [["
+ device_names +
"], []], " " \"cuda:0\": [[" +
"], []], "
" \"npu:0\": [[" +
device_names +
"], []], "
" \"cuda:0\": [[" +

device_names +
"], []], "
" \"musa:0\": [[" +
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,12 +37,13 @@ class RdmaContext;
class RdmaEndPoint;
class TransferMetadata;
class WorkerPool;
class HeterogeneousRdmaTransport;

class RdmaTransport : public Transport {
friend class RdmaContext;
friend class RdmaEndPoint;
friend class WorkerPool;

friend class HeterogeneousRdmaTransport;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify which private member(s) this is needed for? If none currently, this should be removed — friend breaks encapsulation and shouldn't be added speculatively.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HeterogeneousRdmaTransport needs to access the private members submitTransfer, submitTransferTask, and getTransferStatus in RdmaTransport. If friend is not added, it will result in a compilation error.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HeterogeneousRdmaTransport needs to access the private members submitTransfer, submitTransferTask, and getTransferStatus in RdmaTransport. If friend is not added, it will result in a compilation error.

Thanks for the explanation — that makes sense for the original code. However, PR #1663 has since been merged, which moved submitTransfer and related methods from private to public. So the friend class declaration is no longer needed. Could you rebase onto the latest main and drop that change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The corresponding modifications have been submitted.

public:
using BufferDesc = TransferMetadata::BufferDesc;
using SegmentDesc = TransferMetadata::SegmentDesc;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,21 @@
namespace mooncake {

namespace {
// ACL memory location types for aclrtPtrAttributes::location.type:
// 0 = ACL host memory, 1 = device memory, 2 = regular CPU memory (malloc)
static constexpr int kDeviceMemoryLocationType = 1;

bool isCpuMemory(void *addr) {
aclrtPtrAttributes attributes{};
if (int ret = aclrtPointerGetAttributes(addr, &attributes)) {
LOG(ERROR) << "aclrtPointrtGetAttributes error, ret: " << ret;
return false;
// If ACL cannot identify the pointer, it is not ACL-managed device
// memory, so treat it as regular CPU memory.
LOG(WARNING) << "aclrtPointerGetAttributes failed for addr " << addr
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reformat the example file string concatenation

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

<< ", ret: " << ret
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for determining if a memory address belongs to the CPU has been updated. The previous implementation only checked for ACL_HOST memory (type == 0) and incorrectly returned false on aclrtPointerGetAttributes failure. The new logic correctly treats any memory that is not explicitly device memory (type != 1) as CPU memory. This includes ACL_HOST memory (type == 0), regular CPU memory (type == 2), and cases where aclrtPointerGetAttributes fails. This change is crucial for correctness.

<< ". Treating as CPU memory.";
return true;
}
return (attributes.location.type == 0);
return (attributes.location.type != kDeviceMemoryLocationType);
}
} // namespace

Expand Down
Loading