[Distributed][refactor] Add base class for device-specific communicator #11324

MengqingCao · 2024-12-19T08:17:23Z

This PR provide a base class CommunicatorBase for device-specific communicators (HpuCommunicator, TpuCommunicator and XpuCommunicator), avoiding the cumbersome dispatch in each communicator operator of GroupCoordinator, e.g.,
https://github.com/vllm-project/vllm/blob/main/vllm/distributed/parallel_state.py#L342-L353

In this pr, the communication-related classes are organized as the following fig. This allows new backends to implement their own communicators and dynamic dispatch them in the platform.

github-actions · 2024-12-19T08:17:37Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

vllm/distributed/device_communicators/base_communicator.py

mergify · 2025-01-15T06:01:12Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @MengqingCao.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/distributed/device_communicators/base_communicator.py

vllm/distributed/parallel_state.py

vllm/distributed/device_communicators/base_communicator.py

mergify · 2025-02-05T07:30:23Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @MengqingCao.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

MengqingCao · 2025-02-06T07:43:03Z

CI failed due to network issues. This pr is ready for review now, thanks in advance! @youkaichao

Yikun

@youkaichao Would you mind taking another look?

Or if you are worried about the code changes being too big and want us to split the PR, for example:

a separate PR for CommunicatorBase and interface change
Adapts cuda/rocm, hpu, tpu, xpu separately and split to 3 followup PRs

Please let us know, we'd like to do so.

youkaichao · 2025-02-10T05:09:37Z

sorry I'm super busy recently. will review this week.

Signed-off-by: Mengqing Cao <[email protected]>

MengqingCao · 2025-02-13T06:58:27Z

vllm/distributed/parallel_state.py

+                f"{current_platform.device_type}:{local_rank}")
+        else:
+            import torch_xla.core.xla_model as xm
+            self.device = xm.xla_device(local_rank)


Hi @youkaichao , I'm not sure if the initialize of self.device is correct for neuron, openvino and tpu devices. Appreciate your help!

wangxiyuan mentioned this pull request Dec 19, 2024

[RFC]: Hardware pluggable #11162

Closed

1 task

DarkLight1337 requested a review from youkaichao December 19, 2024 08:20

MengqingCao force-pushed the communicator branch from c4f0481 to eeb5aae Compare December 19, 2024 08:54

MengqingCao force-pushed the communicator branch from eeb5aae to 6e6501a Compare December 31, 2024 02:03

MengqingCao force-pushed the communicator branch 3 times, most recently from 1a977d3 to 6de2b98 Compare January 14, 2025 02:01

youkaichao reviewed Jan 15, 2025

View reviewed changes

vllm/distributed/device_communicators/base_communicator.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label Jan 15, 2025

youkaichao reviewed Jan 15, 2025

View reviewed changes

vllm/distributed/device_communicators/base_communicator.py Outdated Show resolved Hide resolved

youkaichao reviewed Jan 15, 2025

View reviewed changes

vllm/distributed/parallel_state.py Outdated Show resolved Hide resolved

youkaichao reviewed Jan 15, 2025

View reviewed changes

vllm/distributed/device_communicators/base_communicator.py Outdated Show resolved Hide resolved

MengqingCao force-pushed the communicator branch from b5d2063 to f03eedb Compare January 15, 2025 08:38

mergify bot removed the needs-rebase label Jan 15, 2025

MengqingCao force-pushed the communicator branch from 242fb40 to b085f82 Compare January 15, 2025 09:11

wangxiyuan mentioned this pull request Jan 24, 2025

Release v0.7.0 #12365

Closed

8 tasks

MengqingCao force-pushed the communicator branch 5 times, most recently from cadfb32 to cc6d46a Compare January 28, 2025 12:42

MengqingCao force-pushed the communicator branch from cc6d46a to 464594a Compare February 5, 2025 01:55

mergify bot added the needs-rebase label Feb 5, 2025

MengqingCao force-pushed the communicator branch from 464594a to 1e986a0 Compare February 5, 2025 07:45

mergify bot removed the needs-rebase label Feb 5, 2025

MengqingCao force-pushed the communicator branch from 1e986a0 to 79a5eb0 Compare February 5, 2025 07:56

MengqingCao force-pushed the communicator branch from 79a5eb0 to 6851cd0 Compare February 6, 2025 01:49

wangxiyuan mentioned this pull request Feb 6, 2025

在运行 vLLM 的 benchmark_latency.py 时，出现 NotImplementedError: 'vllm::all_reduce' 错误 vllm-project/vllm-ascend#16

Closed

Yikun reviewed Feb 7, 2025

View reviewed changes

[Distributed][refactor] Add base class for device-specific communicator

26dcd5d

Signed-off-by: Mengqing Cao <[email protected]>

MengqingCao force-pushed the communicator branch from 6851cd0 to 26dcd5d Compare February 12, 2025 06:20

code format

fd85305

Signed-off-by: Mengqing Cao <[email protected]>

youkaichao marked this pull request as draft February 13, 2025 06:09

fix torch device

0c107ba

Signed-off-by: Mengqing Cao <[email protected]>

MengqingCao commented Feb 13, 2025

View reviewed changes

MengqingCao closed this Feb 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Distributed][refactor] Add base class for device-specific communicator #11324

[Distributed][refactor] Add base class for device-specific communicator #11324

Uh oh!

MengqingCao commented Dec 19, 2024 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Dec 19, 2024

Uh oh!

Uh oh!

mergify bot commented Jan 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Feb 5, 2025

Uh oh!

MengqingCao commented Feb 6, 2025

Uh oh!

Yikun left a comment

Uh oh!

youkaichao commented Feb 10, 2025

Uh oh!

MengqingCao Feb 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[Distributed][refactor] Add base class for device-specific communicator #11324

[Distributed][refactor] Add base class for device-specific communicator #11324

Uh oh!

Conversation

MengqingCao commented Dec 19, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 19, 2024

Uh oh!

Uh oh!

mergify bot commented Jan 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Feb 5, 2025

Uh oh!

MengqingCao commented Feb 6, 2025

Uh oh!

Yikun left a comment

Choose a reason for hiding this comment

Uh oh!

youkaichao commented Feb 10, 2025

Uh oh!

MengqingCao Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MengqingCao commented Dec 19, 2024 •

edited by github-actions bot

Loading