Skip to content

Commit df346f1

Browse files
committed
fix float8 training TP+SP integration tests
Summary: These tests do not run in CI, and they broke some time ago. The issue was that each tensor was created on "cuda:0" instead of using the local rank. For now, fixing by manually specifying the rank. I feel like there is probably a better way to do this as the rank is supposed to be set automatically, but leaving that for a future PR. We should add to CI in the future, saving that for a future PR. Test Plan: ```bash ./test/float8/test_dtensor.sh ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 39bd880 ghstack-comment-id: 2991778315 Pull Request resolved: #2414
1 parent 4e3d019 commit df346f1

File tree

2 files changed

+4
-0
lines changed

2 files changed

+4
-0
lines changed

test/float8/test_dtensor.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,8 @@ def setup_distributed():
6767
device_mesh = init_device_mesh("cuda", (world_size,))
6868
# seed must be the same in all processes
6969
torch.manual_seed(1)
70+
local_rank = torch.distributed.get_rank()
71+
torch.cuda.set_device(local_rank)
7072
return device_mesh
7173

7274

test/float8/test_fsdp2_tp.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,8 @@ def setup_distributed():
4646
)
4747
# seed must be the same in all processes
4848
torch.manual_seed(1)
49+
local_rank = torch.distributed.get_rank()
50+
torch.cuda.set_device(local_rank)
4951
return device_mesh
5052

5153

0 commit comments

Comments
 (0)