Implement collective gather op #9435

bfolie · 2025-07-01T21:46:28Z

bfolie · 2025-07-01T23:21:24Z

test/pjrt/test_collective_ops_tpu.py

-  @staticmethod
-  def _scatter():
-    dist.init_process_group("xla", init_method='xla://')
-    device = torch_xla.device()
-    world_size = xr.world_size()
-    tensors = None
-    if xr.global_ordinal() == 0:
-      tensors = [
-          torch.tensor([i], device=device, dtype=torch.float)
-          for i in range(world_size)
-      ]
-
-    output_tensor = torch.tensor([-1], dtype=torch.float, device=device)
-    dist.scatter(output_tensor, tensors, src=0)
-    return output_tensor.cpu()
-
-  def test_scatter(self):
-    """self._scatter instantiates a list of tensors [[0], [1], ..., [n-1]]
-    on device 0, then scatters it. Device i should therefore receive [i]."""
-    results = pjrt.run_multiprocess(self._scatter)
-    for ordinal, value in results.items():
-      np.testing.assert_array_equal(value, [ordinal])
-


Just moving this test into the appropriate class

bfolie · 2025-07-01T23:21:33Z

test/pjrt/test_collective_ops_tpu.py

+  @staticmethod
+  def _scatter():
+    dist.init_process_group("xla", init_method='xla://')
+    device = torch_xla.device()
+    world_size = xr.world_size()
+    tensors = None
+    if xr.global_ordinal() == 0:
+      tensors = [
+          torch.tensor([i], device=device, dtype=torch.float)
+          for i in range(world_size)
+      ]
+
+    output_tensor = torch.tensor([-1], dtype=torch.float, device=device)
+    dist.scatter(output_tensor, tensors, src=0)
+    return output_tensor.cpu()
+
+  def test_scatter(self):
+    """self._scatter instantiates a list of tensors [[0], [1], ..., [n-1]]
+    on device 0, then scatters it. Device i should therefore receive [i]."""
+    results = pjrt.run_multiprocess(self._scatter)
+    for ordinal, value in results.items():
+      np.testing.assert_array_equal(value, [ordinal])


copied from above

bfolie · 2025-07-02T15:33:37Z

Failing tests are expected until the TPU CI cluster is updated to use python 3.12. See #9434

pgmoka · 2025-07-08T21:37:36Z

torch_xla/distributed/xla_backend.py

+          input_for_all_gather, dim=0, groups=self._mesh, pin_layout=False)
+      # Syncing is required to keep the heterogeneous copying below at the


NIT: Add space between code line and comment.

pgmoka · 2025-07-08T21:40:23Z

torch_xla/distributed/xla_backend.py

+    rank = xr.global_ordinal()
+
+    for i, input_tensor in enumerate(input_tensor_list):
+      is_scalar = input_tensor.dim() == 0


This is happening during each iteration of the loop.

If input_tensor_list is not empty, could we not do something like is_scalar = input_tensor_list[0].dim() == 0?

It's not guaranteed that every element of input_tensor_list has the same size. They're basically independent gather operations.

pgmoka · 2025-07-08T21:48:15Z

torch_xla/distributed/xla_backend.py

+    if rank == opts.rootRank:
+      return _ret_work(output_tensors_list)
+    else:
+      return _ret_work([[]])


What is going on here? From this base reading, it is that if rank != opts.rootRank, return an empty.

If that is the case, could we add something like:

if rank != opts.rootRank: return _ret_work([[]])

In the beginning of the function, and avoid this if split, as well as the one above?

Good point -- that would make the code simpler

Actually no -- all non-dst ranks still need to call the all_gather with their input and then sync. It's only the copying to the output which is device-specific. Which means we can't return at the beginning of the function

bfolie added 4 commits June 30, 2025 16:54

first attempt at gather, hangs

5203f38

get gather working

e9fe828

make gather work for coalesced inputs

4562687

add some more comments to test

8173e71

bfolie mentioned this pull request Jul 1, 2025

[RFC] Improved coverage for native distributed collective operations #9315

Open

bfolie added 2 commits July 1, 2025 21:54

format

39f1908

move scatter and gather tests into more appropriate class

1a6c3b6

bfolie requested a review from pgmoka July 1, 2025 23:08

bfolie commented Jul 1, 2025

View reviewed changes

bfolie requested a review from benawilson July 2, 2025 19:31

pgmoka reviewed Jul 8, 2025

View reviewed changes

Add space before comment

619d443

bfolie enabled auto-merge (squash) July 9, 2025 20:06

bfolie disabled auto-merge July 9, 2025 20:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement collective gather op #9435

Implement collective gather op #9435

Uh oh!

bfolie commented Jul 1, 2025

Uh oh!

bfolie Jul 1, 2025

Uh oh!

bfolie Jul 1, 2025

Uh oh!

bfolie commented Jul 2, 2025

Uh oh!

pgmoka Jul 8, 2025

Uh oh!

pgmoka Jul 8, 2025

Uh oh!

bfolie Jul 8, 2025

Uh oh!

pgmoka Jul 8, 2025

Uh oh!

bfolie Jul 8, 2025

Uh oh!

bfolie Jul 9, 2025

Uh oh!

Uh oh!

		input_for_all_gather, dim=0, groups=self._mesh, pin_layout=False)
		# Syncing is required to keep the heterogeneous copying below at the

Implement collective gather op #9435

Are you sure you want to change the base?

Implement collective gather op #9435

Uh oh!

Conversation

bfolie commented Jul 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bfolie commented Jul 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!