Skip to content

Conversation

@MengqingCao
Copy link
Contributor

@MengqingCao MengqingCao commented Dec 19, 2024

part of #11162

This PR provide a base class CommunicatorBase for device-specific communicators (HpuCommunicator, TpuCommunicator and XpuCommunicator), avoiding the cumbersome dispatch in each communicator operator of GroupCoordinator, e.g.,
https://github.com/vllm-project/vllm/blob/main/vllm/distributed/parallel_state.py#L342-L353

In this pr, the communication-related classes are organized as the following fig. This allows new backends to implement their own communicators and dynamic dispatch them in the platform.
image

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@mergify
Copy link

mergify bot commented Jan 15, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @MengqingCao.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 15, 2025
@mergify
Copy link

mergify bot commented Feb 5, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @MengqingCao.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@MengqingCao
Copy link
Contributor Author

CI failed due to network issues. This pr is ready for review now, thanks in advance! @youkaichao

Copy link
Member

@Yikun Yikun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@youkaichao Would you mind taking another look?

Or if you are worried about the code changes being too big and want us to split the PR, for example:

  • a separate PR for CommunicatorBase and interface change
  • Adapts cuda/rocm, hpu, tpu, xpu separately and split to 3 followup PRs

Please let us know, we'd like to do so.

@youkaichao
Copy link
Member

sorry I'm super busy recently. will review this week.

Signed-off-by: Mengqing Cao <[email protected]>
@youkaichao youkaichao marked this pull request as draft February 13, 2025 06:09
Signed-off-by: Mengqing Cao <[email protected]>
f"{current_platform.device_type}:{local_rank}")
else:
import torch_xla.core.xla_model as xm
self.device = xm.xla_device(local_rank)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @youkaichao , I'm not sure if the initialize of self.device is correct for neuron, openvino and tpu devices. Appreciate your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants