[RFC] Generalize pytorch content for non-native device execution

ankurneog · ankurneog · commit b7d52b84ad42 · 2024-08-12T13:50:31.000+03:00
diff --git a/RFC-0039-generalize-pytorch-ut.md b/RFC-0039-generalize-pytorch-ut.md
@@ -0,0 +1,52 @@
+
+# [RFC] Generalization of PyTorch framework UT for non-cuda device execution
+
+**Authors:**
+* @ankurneog
+
+
+## **Summary**
+Modify PyTorch framework UTs so that non-cuda devices such as intel Gaudi and intel XPU is able to harness the content and improve quality.
+
+
+## **Motivation**
+The Pytorch framework UTs are good indicator for device stack health, however these are mostly written for cpu and cuda devices, which restricts its use for non-cuda devices.
+
+We propose to modify the content wherever possible to make it available for non-cuda device execution
+
+This will also ensure greater participation for content enhancement.
+
+## **Proposed Implementation**
+Since the content is huge, we propose a staggered approach for the implementation
+Steps:
+*   Remove restriction imposed through @onlyNativeDevices in core content, replace these with hooks so that supported devices can enable their content selectively.
+These should be flexible enough to support both in-tree and out-of-tree devices.
+*   Dtypes for a device should be dynamically loaded per op based on a common dictionary, instead of using different variables per device , eg: dtypesIfCuda
+* Miscelleneous decorators such as @skipIfCuda should be generalized @skipIfDevice
+*   Extend use of instantiate_device_type for all content, so that developers are forced to use generalized device code rather than using "cuda" or "cpu"
+* Generalize common distributed content , so that it can be extended for non nccl backends such as intel's hccl and ccl
+* Generalize the dynamo content for specific backends which other devices might want to verify with existing content.
+
+
+
+#### Metrics
+Other devices can track the pass-percentage and be part of the CI if the coverage and pass percentage is good.
+
+#### Additional Context
+Towards adding support for Intel Gaudi devices we have already done couple of changes in this regard.
+* Removing onlyNativeDevice : https://github.com/pytorch/pytorch/pull/128584
+
+* Changing Dynamo Content : https://github.com/pytorch/pytorch/pull/130714
+
+* Generalizing Distributed Content : https://github.com/pytorch/pytorch/pull/131758
+
+* Generalizing FSDP Content : https://github.com/pytorch/pytorch/pull/133209
+
+More to follow
+
+
+### Next Steps
+As part of introducing support for intel Gaudi which is an out-of-tree device, we are already introduces changes to support it in a manner that can be used by other devices as well.
+
+
+