implement collective all_to_all op #9442

bfolie · 2025-07-02T18:52:25Z

pgmoka

LGTM

yaoshiang

it seems like the implementation has always been there but just not exposed? Any thoughts on why that might be?

yaoshiang · 2025-07-08T22:33:11Z

test/pjrt/test_collective_ops_tpu.py

+
+    return [t.cpu() for t in output_tensors]
+
+  def test_all_to_all(self):


do you think a performance test might be necessary here to ensure there's no unforseen bottleneck creating latency/throughput issues?

Yes. Stage 1 of this project is to improve op coverage. Stage 2 is to rigorously benchmark how the collective ops scale and identify any bottlenecks. I've been reassigned to work on the new repo so I won't be doing stage 2, at least in the near future.

yaoshiang · 2025-07-08T22:34:05Z

test/pjrt/test_collective_ops_tpu.py

@@ -359,6 +359,36 @@ def test_all_to_all_single(self, use_dynamo):
                         expected.sort().values),
          f"Got {val}, expected {expected}")

+  @staticmethod


are tests in pjrt/ designed to run on some non-trival distributed setup?

In some cases yes. The tests in this file are run by tpu/run_tests.sh and expect multiple TPUs. Some of the other files in pjrt/ are part of the basic test suite (example)

bfolie · 2025-07-08T23:25:54Z

it seems like the implementation has always been there but just not exposed? Any thoughts on why that might be?

How do you mean? There are two torch.distributed functions: all_to_all_single and all_to_all. The former is already implemented and exposed. This PR implements the latter. Both use the same underlying xla function, xla_model.all_to_all.

implement all_to_all collective op

a3e7c54

bfolie mentioned this pull request Jul 1, 2025

[RFC] Improved coverage for native distributed collective operations #9315

Open

bfolie requested review from pgmoka and benawilson July 2, 2025 19:29

qihqi approved these changes Jul 8, 2025

View reviewed changes

pgmoka approved these changes Jul 8, 2025

View reviewed changes

yaoshiang reviewed Jul 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

implement collective all_to_all op #9442

implement collective all_to_all op #9442

Uh oh!

bfolie commented Jul 2, 2025

Uh oh!

pgmoka left a comment

Uh oh!

yaoshiang left a comment •

edited

Loading

Uh oh!

yaoshiang Jul 8, 2025

Uh oh!

bfolie Jul 8, 2025 •

edited

Loading

Uh oh!

yaoshiang Jul 8, 2025

Uh oh!

bfolie Jul 8, 2025 •

edited

Loading

Uh oh!

bfolie commented Jul 8, 2025

Uh oh!

Uh oh!


		return [t.cpu() for t in output_tensors]

		def test_all_to_all(self):

implement collective all_to_all op #9442

Are you sure you want to change the base?

implement collective all_to_all op #9442

Uh oh!

Conversation

bfolie commented Jul 2, 2025

Uh oh!

pgmoka left a comment

Choose a reason for hiding this comment

Uh oh!

yaoshiang left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaoshiang Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

bfolie Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yaoshiang Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

bfolie Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bfolie commented Jul 8, 2025

Uh oh!

Uh oh!

yaoshiang left a comment •

edited

Loading

bfolie Jul 8, 2025 •

edited

Loading

bfolie Jul 8, 2025 •

edited

Loading