-
Notifications
You must be signed in to change notification settings - Fork 553
[stale] Test backward pass nn model with dynamic input #4289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Succeeds when we use dynamic test data but static training data: f82efbc. It outputs:
|
Once I made the test data dynamic (0048f3b), the test failed with error:
which I'm going to look into. |
[SPMD] Performance debugging: * Check & skip sharding annotation if the same sharding already exists. * Create a tensor node or re-use the current IR in case of empty IR node.
* Test trigger and schedule with terraform * Support multiple TF patchces. * Tag artifacts with date and platform * Add requests requirement * Fix docker tagging. * Put trigger config in a module * Document schedule string * Update SA in config
Summary: This patch lets XLAGraphExecutor inherits LazyGraphExecutor. A few things worth noting: 1. It replaces xla::util::ExceptionCleanup with torch::lazy::ExceptionCleanup. 2. It uses upstream SyncTensorsConfig, SyncTensorCollection, and PostOrderData. Test Plan: CI.
Summary: This patch tweaks XLATensor class to use some identical methods from LazyTensor. Here are a few things worth noting: 1. It renames CurrentXlaData() to CurrentDataHandle() such that we can reuse the latter. 2. It replaces data_ptr() with data() which now returns a const shared_ptr& type. 3. It tweaks the XLATensor::Data class to inherits from LazyTensor::Data. In order to provide accesses to this core data member, here we store two shared_ptrs. One in the LazyTensor class as the LazyTensor::Data type for base class method to access. One in the XLATensor class as the derived XLATensor::Data type such that it's easier to access XLA extra fields within. 4. Methods being removed from XLATensor are: generation(), alias(), GetDevice(), GetUniqueId(), SetTensorData(), and GetNextTensorId(). Test Plan: CI.
* Update PJRT release notes with latest changes. * Explain `broadcast_master_param`.
* Add a gpu readme * add wheel instruction * add wheel link
* update from 1.12 to 1.13. * update wheel from py 3.7 to py 3.8
Looks like sigmoid_bakwards needs to support dynamism. Wdyt? |
Yeah, Also, for the error |
* Add ds model test to the ci. * fix linter * add verbose flag. * fix pr comments * fix linter * fix the test * disable the test on CPU. * fix linter * Revert "fix linter" This reverts commit 14e30de. * Revert "disable the test on CPU." This reverts commit a49baf4. * improve tests. * should expect the test fail. * make the test show which speicifc test fails via Verbose flag. * improve the test * Test on GPU is skipped. Fixed it. * fix linter
Summary: This patch reuses more upstream LazyGraphExecutor data structures, including: 1. TlsData. 2. DeviceLocker. 3. DeviceLockerArena. 4. DataCacheArena. And also reuses the following methods: 1. DeviceBarrier(). 2. GetDeviceData(). 3. IncTrimCounter(). P.S. It needs pytorch/pytorch#90457 to work. Test Plan: CI.
* Add note [Note: Re-using upstream TensorImpl] * Run linter * Remove test file
Summary: This patch reuses more upstream LazyGraphExecutor data structures, including: 1. TlsData. 2. DeviceLocker. 3. DeviceLockerArena. 4. DataCacheArena. And also reuses the following methods: 1. DeviceBarrier(). 2. GetDeviceData(). 3. IncTrimCounter(). P.S. It needs pytorch/pytorch#90598 to work. And it tries to re-land #4303. Test Plan: CI.
Summary: Adds sympy as one of the pytorch deps. Test Plan: CI.
* disable all failed PyTorch test * format
* Add one more symint expand test. * fixed the test.
Summary: This patch inherits LazyGraphExecutor::DeviceContextArena and overrides the follow methods: 1. GetLiveTensors to return XLATensorPtrs. 2. GetRngSeed to use our own + and * for torch::lazy::Value. 3. IrValueFromScalar to use TensorToXlaData. In addition, it has an extra method: GetBaseSeedData that is used by dynamo. This patch needs pytorch/pytorch#90531 to function. Test Plan: CI.
* Correctly configure PJRT execution when nprocs=1. Currently there is not easy way to tweak a model written with pjrt.run_multiprocess to run with single core/device easily. This is a useful feature when debugging some model failure. In the XRT world, we can pass the `num_process` as 1 to `xmp.spawn`, and this should be mirrored in PJRT. * Format python * Incorporate review feedback * Use executor until more review comments come in * Update pjrt world size test. * Remove subprocess executors for nprocs=1. * Cleanup duplicate code and unneeded wrapper.
Summary: This patch tries to adopt even more LazyGraphExecutor virtual interfaces: 1. LazyGraphExecutor::Async. 2. TensorCollectionBarrier 3. SyncLiveTensorsGraph And comments on methods that we don't adopt. This depends on pytorch/pytorch#90650. Test Plan: CI.
This error is not one I've seen before. At a guess, CLA's bounded symints don't correctly implement the equality/comparison operators and a test AD engine is doing is failing. If you log symint ops should become clear |
Summary: This patch overrides LazyGraphExecutor::RunPostOrder() and adds a few comments on why some methods are different from the upstream. This depends on pytorch/pytorch#90680. Test Plan: CI
i messed up this branch so I created a new one. |
No description provided.