|
| 1 | + |
| 2 | +# [RFC] Generalization of PyTorch framework UT for non-cuda device execution |
| 3 | + |
| 4 | +**Authors:** |
| 5 | +* @ankurneog |
| 6 | + |
| 7 | + |
| 8 | +## **Summary** |
| 9 | +Modify PyTorch framework UTs so that non-cuda devices such as intel Gaudi and intel XPU is able to harness the content and improve quality. |
| 10 | + |
| 11 | + |
| 12 | +## **Motivation** |
| 13 | +The Pytorch framework UTs are good indicator for device stack health, however these are mostly written for cpu and cuda devices, which restricts its use for non-cuda devices. |
| 14 | + |
| 15 | +We propose to modify the content wherever possible to make it available for non-cuda device execution |
| 16 | + |
| 17 | +This will also ensure greater participation for content enhancement. |
| 18 | + |
| 19 | +## **Proposed Implementation** |
| 20 | +Since the content is huge, we propose a staggered approach for the implementation |
| 21 | +Steps: |
| 22 | +* Remove restriction imposed through @onlyNativeDevices in core content, replace these with hooks so that supported devices can enable their content selectively. |
| 23 | +These should be flexible enough to support both in-tree and out-of-tree devices. |
| 24 | +* Dtypes for a device should be dynamically loaded per op based on a common dictionary, instead of using different variables per device , eg: dtypesIfCuda |
| 25 | +* Miscelleneous decorators such as @skipIfCuda should be generalized @skipIfDevice |
| 26 | +* Extend use of instantiate_device_type for all content, so that developers are forced to use generalized device code rather than using "cuda" or "cpu" |
| 27 | +* Generalize common distributed content , so that it can be extended for non nccl backends such as intel's hccl and ccl |
| 28 | +* Generalize the dynamo content for specific backends which other devices might want to verify with existing content. |
| 29 | + |
| 30 | + |
| 31 | + |
| 32 | +#### Metrics |
| 33 | +Other devices can track the pass-percentage and be part of the CI if the coverage and pass percentage is good. |
| 34 | + |
| 35 | +#### Additional Context |
| 36 | +Towards adding support for Intel Gaudi devices we have already done couple of changes in this regard. |
| 37 | +* Removing onlyNativeDevice : https://github.com/pytorch/pytorch/pull/128584 |
| 38 | + |
| 39 | +* Changing Dynamo Content : https://github.com/pytorch/pytorch/pull/130714 |
| 40 | + |
| 41 | +* Generalizing Distributed Content : https://github.com/pytorch/pytorch/pull/131758 |
| 42 | + |
| 43 | +* Generalizing FSDP Content : https://github.com/pytorch/pytorch/pull/133209 |
| 44 | + |
| 45 | +More to follow |
| 46 | + |
| 47 | + |
| 48 | +### Next Steps |
| 49 | +As part of introducing support for intel Gaudi which is an out-of-tree device, we are already introduces changes to support it in a manner that can be used by other devices as well. |
| 50 | + |
| 51 | + |
| 52 | + |
0 commit comments