-
Notifications
You must be signed in to change notification settings - Fork 530
Commit 8c7fa2f
Sync collectives refactoring (#2039)
Summary:
Pull Request resolved: #2039
Reland of D57564130
**What is changed after revert**:
Torch Library can not be used inside Deploy.
Guarded in comm_ops.py all operators definitions and autograd registrations with `not torch._running_with_deploy():`
**Catching deploy compat on diff test/land**: D57773561
**Previous diff Summary:**
The diff refactors torchrec sync collectives and addresses issues with missing wait_tensor() for backward:
- Refactoring using latest Torchrec Library Custom Op API with PT2 compatibility
- Removing non-Native functional collectives calls (c10d_functional), as only native exist now in pytorch and non-native are redispatched to native.
- Adding test cases for compiled-with-noncompiled ranks (in case of compilation failure on one of the ranks)
Issues fixed:
- Sync collectives eager backward did not produce gradient -> Fixed
- Support gradient_division in sync collectives and its compilation -> Done
- Test coverage of sync collectives comparing results with async collectives and compilation.
- Fixed Missing wait_tensor
The warning:
```
W0520 07:16:25.135696 2546100 Functional.cpp:51] Warning: At the time of process termination, there are still 1 unwaited c10d_functional collective calls. Please review your program to ensure c10d_functional.wait_tensor() is invoked on all tensors returned from c10d_functional collective ops before they are used. (function ~WorkRegistry)
ok
```
Reviewed By: ezyang
Differential Revision: D57774293
fbshipit-source-id: 76da888f4b6e876aa1ad170857e7db76ac4181221 parent f24c8dc commit 8c7fa2fCopy full SHA for 8c7fa2f
File tree
Expand file treeCollapse file tree
2 files changed
+431
-515
lines changedFilter options
- torchrec/distributed
- tests
Expand file treeCollapse file tree
2 files changed
+431
-515
lines changed
0 commit comments