[stale] Test backward pass nn model with dynamic input #4289

vanbasten23 · 2022-12-06T19:52:23Z

No description provided.

vanbasten23 · 2022-12-06T20:40:07Z

Succeeds when we use dynamic test data but static training data: f82efbc. It outputs:

Finished training. Got loss: 0.686253547668457
Finished testing, loss= 0.6257358193397522

vanbasten23 · 2022-12-06T20:41:47Z

Once I made the test data dynamic (0048f3b), the test failed with error:

root@t1v-n-2a2b95ef-w-0:/workspaces/work# python3 pytorch/xla/test/test_dynamic_shape_models.py
Traceback (most recent call last):
  File "pytorch/xla/test/test_dynamic_shape_models.py", line 78, in <module>
    train(model, loss_fn=criterion, optimizer=optimizer)
  File "pytorch/xla/test/test_dynamic_shape_models.py", line 65, in train
    loss.backward()
  File "/home/ptxla/.local/lib/python3.8/site-packages/torch/_tensor.py", line 484, in backward
    torch.autograd.backward(
  File "/home/ptxla/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function SigmoidBackward0 returned an invalid gradient at index 0 - got [80, 1] but expected shape compatible with [<=80, 1]

which I'm going to look into.

[SPMD] Performance debugging: * Check & skip sharding annotation if the same sharding already exists. * Create a tensor node or re-use the current IR in case of empty IR node.

* Test trigger and schedule with terraform * Support multiple TF patchces. * Tag artifacts with date and platform * Add requests requirement * Fix docker tagging. * Put trigger config in a module * Document schedule string * Update SA in config

Summary: This patch lets XLAGraphExecutor inherits LazyGraphExecutor. A few things worth noting: 1. It replaces xla::util::ExceptionCleanup with torch::lazy::ExceptionCleanup. 2. It uses upstream SyncTensorsConfig, SyncTensorCollection, and PostOrderData. Test Plan: CI.

Summary: This patch tweaks XLATensor class to use some identical methods from LazyTensor. Here are a few things worth noting: 1. It renames CurrentXlaData() to CurrentDataHandle() such that we can reuse the latter. 2. It replaces data_ptr() with data() which now returns a const shared_ptr& type. 3. It tweaks the XLATensor::Data class to inherits from LazyTensor::Data. In order to provide accesses to this core data member, here we store two shared_ptrs. One in the LazyTensor class as the LazyTensor::Data type for base class method to access. One in the XLATensor class as the derived XLATensor::Data type such that it's easier to access XLA extra fields within. 4. Methods being removed from XLATensor are: generation(), alias(), GetDevice(), GetUniqueId(), SetTensorData(), and GetNextTensorId(). Test Plan: CI.

* Update PJRT release notes with latest changes. * Explain `broadcast_master_param`.

* Add a gpu readme * add wheel instruction * add wheel link

* update from 1.12 to 1.13. * update wheel from py 3.7 to py 3.8

miladm · 2022-12-08T23:18:45Z

Looks like sigmoid_bakwards needs to support dynamism. Wdyt?

vanbasten23 · 2022-12-08T23:30:26Z

Yeah, Also, for the error RuntimeError: Function SigmoidBackward0 returned an invalid gradient at index 0 - got [80, 1] but expected shape compatible with [<=80, 1] , it failed at https://github.com/pytorch/pytorch/blob/912a1f7b2776c0e7ebf9038e4483a4aa709aa893/torch/csrc/autograd/engine.cpp#L812. Stacktrace: https://gist.github.com/vanbasten23/a68180922e9f4c554b92365c961c21a4

* Add ds model test to the ci. * fix linter * add verbose flag. * fix pr comments * fix linter * fix the test * disable the test on CPU. * fix linter * Revert "fix linter" This reverts commit 14e30de. * Revert "disable the test on CPU." This reverts commit a49baf4. * improve tests. * should expect the test fail. * make the test show which speicifc test fails via Verbose flag. * improve the test * Test on GPU is skipped. Fixed it. * fix linter

Summary: This patch reuses more upstream LazyGraphExecutor data structures, including: 1. TlsData. 2. DeviceLocker. 3. DeviceLockerArena. 4. DataCacheArena. And also reuses the following methods: 1. DeviceBarrier(). 2. GetDeviceData(). 3. IncTrimCounter(). P.S. It needs pytorch/pytorch#90457 to work. Test Plan: CI.

* Add note [Note: Re-using upstream TensorImpl] * Run linter * Remove test file

…4303)" (#4309) This reverts commit 653df69.

Summary: This patch reuses more upstream LazyGraphExecutor data structures, including: 1. TlsData. 2. DeviceLocker. 3. DeviceLockerArena. 4. DataCacheArena. And also reuses the following methods: 1. DeviceBarrier(). 2. GetDeviceData(). 3. IncTrimCounter(). P.S. It needs pytorch/pytorch#90598 to work. And it tries to re-land #4303. Test Plan: CI.

Summary: Adds sympy as one of the pytorch deps. Test Plan: CI.

* disable all failed PyTorch test * format

miladm · 2022-12-13T01:27:15Z

We have the stack trace pointing to the pytorch/torch/csrc/autograd/engine.cpp error. See the previous comment. @wconstab @ezyang we are wondering if the autograd engine is missing dynamism support. Wdyt?

* Add one more symint expand test. * fixed the test.

Summary: This patch inherits LazyGraphExecutor::DeviceContextArena and overrides the follow methods: 1. GetLiveTensors to return XLATensorPtrs. 2. GetRngSeed to use our own + and * for torch::lazy::Value. 3. IrValueFromScalar to use TensorToXlaData. In addition, it has an extra method: GetBaseSeedData that is used by dynamo. This patch needs pytorch/pytorch#90531 to function. Test Plan: CI.

* Correctly configure PJRT execution when nprocs=1. Currently there is not easy way to tweak a model written with pjrt.run_multiprocess to run with single core/device easily. This is a useful feature when debugging some model failure. In the XRT world, we can pass the `num_process` as 1 to `xmp.spawn`, and this should be mirrored in PJRT. * Format python * Incorporate review feedback * Use executor until more review comments come in * Update pjrt world size test. * Remove subprocess executors for nprocs=1. * Cleanup duplicate code and unneeded wrapper.

Summary: This patch tries to adopt even more LazyGraphExecutor virtual interfaces: 1. LazyGraphExecutor::Async. 2. TensorCollectionBarrier 3. SyncLiveTensorsGraph And comments on methods that we don't adopt. This depends on pytorch/pytorch#90650. Test Plan: CI.

ezyang · 2022-12-13T12:06:24Z

This error is not one I've seen before. At a guess, CLA's bounded symints don't correctly implement the equality/comparison operators and a test AD engine is doing is failing. If you log symint ops should become clear

Summary: This patch overrides LazyGraphExecutor::RunPostOrder() and adds a few comments on why some methods are different from the upstream. This depends on pytorch/pytorch#90680. Test Plan: CI

vanbasten23 · 2022-12-14T20:47:43Z

i messed up this branch so I created a new one.

vanbasten23 added 10 commits November 30, 2022 18:20

Add ds model test to the ci.

df8d18d

fix linter

2717f76

add verbose flag.

6a87404

fix pr comments

300640c

fix linter

7269590

fix the test

bd3363f

revert the change

e1c7396

the forward pass model works.

3bd7fac

added the backward model.

b7fc172

added the backward test.

f82efbc

make test data dynamic

0048f3b

vanbasten23 and others added 12 commits December 7, 2022 00:49

update the model.

708873e

Sharding annotation for empty node (#4292)

5eb9ad4

[SPMD] Performance debugging: * Check & skip sharding annotation if the same sharding already exists. * Create a tensor node or re-use the current IR in case of empty IR node.

Add virtual device metrics (#4279)

e415196

Document environment variables (#4273)

36c53d6

add docs readme back (#4301)

f80a981

api to grad base seed as device data (#4293)

feba771

[PJRT] Update PJRT release notes with latest changes. (#4269)

9e2fb33

* Update PJRT release notes with latest changes. * Explain `broadcast_master_param`.

Add a simple gpu doc (#4302)

632de2e

* Add a gpu readme * add wheel instruction * add wheel link

Update torch_xla version for colab from 1.12 to 1.13 (#4295)

7a4dcb3

* update from 1.12 to 1.13. * update wheel from py 3.7 to py 3.8

JackCaoG mentioned this pull request Dec 9, 2022

Add a forward pass nn model with dynamism test. #4256

Merged

wonjoo-wj and others added 6 commits December 9, 2022 13:47

Add note [Note: Re-using upstream TensorImpl] (#4284)

385aa7c

* Add note [Note: Re-using upstream TensorImpl] * Run linter * Remove test file

Revert "[LTC] Reuse more upstream LazyGraphExecutor data structures (#…

0705375

…4303)" (#4309) This reverts commit 653df69.

Fixing CI (#4313)

4c7c343

Summary: Adds sympy as one of the pytorch deps. Test Plan: CI.

[WIP]Disable all failed PyTorch test (#4299)

e92479a

* disable all failed PyTorch test * format

Add met.celar_all (#4310)

b350f36

vanbasten23 and others added 4 commits December 12, 2022 17:38

Add one more symint expand test. (#4291)

6cf15f3

* Add one more symint expand test. * fixed the test.

[LTC] Override LazyGraphExecutor::RunPostOrder() (#4315)

4ffdfe2

Summary: This patch overrides LazyGraphExecutor::RunPostOrder() and adds a few comments on why some methods are different from the upstream. This depends on pytorch/pytorch#90680. Test Plan: CI

vanbasten23 changed the title ~~Test backward pass model ds~~ Test backward pass nn model with dynamic input Dec 13, 2022

vanbasten23 added 9 commits December 13, 2022 19:56

Add ds model test to the ci.

b0c00ba

fix linter

d5a0503

add verbose flag.

eb47e1d

fix pr comments

8d297bc

fix linter

b8bc241

rename the file.

f666b01

added test_dynamic_shape_model.py back.

57ca205

resolve merge conflicts

6e9fecd

fixed conflicts.

97fcd59

vanbasten23 changed the title ~~Test backward pass nn model with dynamic input~~ [stale] Test backward pass nn model with dynamic input Dec 13, 2022

vanbasten23 mentioned this pull request Dec 13, 2022

Add dynamic shape support to sigmoidbackward #4322

Merged

vanbasten23 closed this Dec 14, 2022

vanbasten23 mentioned this pull request Jan 4, 2023

Force fail on all unimplemented methods. #4409

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[stale] Test backward pass nn model with dynamic input #4289

[stale] Test backward pass nn model with dynamic input #4289

Uh oh!

vanbasten23 commented Dec 6, 2022

Uh oh!

vanbasten23 commented Dec 6, 2022

Uh oh!

vanbasten23 commented Dec 6, 2022

Uh oh!

miladm commented Dec 8, 2022

Uh oh!

vanbasten23 commented Dec 8, 2022 •

edited

Loading

Uh oh!

miladm commented Dec 13, 2022

Uh oh!

ezyang commented Dec 13, 2022

Uh oh!

vanbasten23 commented Dec 14, 2022

Uh oh!

Uh oh!

[stale] Test backward pass nn model with dynamic input #4289

[stale] Test backward pass nn model with dynamic input #4289

Uh oh!

Conversation

vanbasten23 commented Dec 6, 2022

Uh oh!

vanbasten23 commented Dec 6, 2022

Uh oh!

vanbasten23 commented Dec 6, 2022

Uh oh!

miladm commented Dec 8, 2022

Uh oh!

vanbasten23 commented Dec 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

miladm commented Dec 13, 2022

Uh oh!

ezyang commented Dec 13, 2022

Uh oh!

vanbasten23 commented Dec 14, 2022

Uh oh!

Uh oh!

vanbasten23 commented Dec 8, 2022 •

edited

Loading