Skip to content

[stale] Test backward pass nn model with dynamic input #4289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 45 commits into from

Conversation

vanbasten23
Copy link
Collaborator

No description provided.

@vanbasten23
Copy link
Collaborator Author

Succeeds when we use dynamic test data but static training data: f82efbc. It outputs:

Finished training. Got loss: 0.686253547668457
Finished testing, loss= 0.6257358193397522

@vanbasten23
Copy link
Collaborator Author

Once I made the test data dynamic (0048f3b), the test failed with error:

root@t1v-n-2a2b95ef-w-0:/workspaces/work# python3 pytorch/xla/test/test_dynamic_shape_models.py
Traceback (most recent call last):
  File "pytorch/xla/test/test_dynamic_shape_models.py", line 78, in <module>
    train(model, loss_fn=criterion, optimizer=optimizer)
  File "pytorch/xla/test/test_dynamic_shape_models.py", line 65, in train
    loss.backward()
  File "/home/ptxla/.local/lib/python3.8/site-packages/torch/_tensor.py", line 484, in backward
    torch.autograd.backward(
  File "/home/ptxla/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function SigmoidBackward0 returned an invalid gradient at index 0 - got [80, 1] but expected shape compatible with [<=80, 1]

which I'm going to look into.

vanbasten23 and others added 12 commits December 7, 2022 00:49
[SPMD] Performance debugging:
* Check & skip sharding annotation if the same sharding already exists.
* Create a tensor node or re-use the current IR in case of empty IR node.
* Test trigger and schedule with terraform

* Support multiple TF patchces.

* Tag artifacts with date and platform

* Add requests requirement

* Fix docker tagging.

* Put trigger config in a module

* Document schedule string

* Update SA in config
Summary:
This patch lets XLAGraphExecutor inherits LazyGraphExecutor. A few things worth noting:
1. It replaces xla::util::ExceptionCleanup with torch::lazy::ExceptionCleanup.
2. It uses upstream SyncTensorsConfig, SyncTensorCollection, and PostOrderData.

Test Plan:
CI.
Summary:
This patch tweaks XLATensor class to use some identical methods from LazyTensor. Here are a few things worth noting:

1. It renames CurrentXlaData() to CurrentDataHandle() such that we can reuse the latter.
2. It replaces data_ptr() with data() which now returns a const shared_ptr& type.
3. It tweaks the XLATensor::Data class to inherits from LazyTensor::Data. In order to provide accesses to this core data member, here we store two shared_ptrs. One in the LazyTensor class as the LazyTensor::Data type for base class method to access. One in the XLATensor class as the derived XLATensor::Data type such that it's easier to access XLA extra fields within.
4. Methods being removed from XLATensor are: generation(), alias(), GetDevice(), GetUniqueId(), SetTensorData(), and GetNextTensorId().

Test Plan:
CI.
* Update PJRT release notes with latest changes.

* Explain `broadcast_master_param`.
* Add a gpu readme

* add wheel instruction

* add wheel link
* update from 1.12 to 1.13.

* update wheel from py 3.7 to py 3.8
@miladm
Copy link
Collaborator

miladm commented Dec 8, 2022

Looks like sigmoid_bakwards needs to support dynamism. Wdyt?

@vanbasten23
Copy link
Collaborator Author

vanbasten23 commented Dec 8, 2022

Yeah, Also, for the error RuntimeError: Function SigmoidBackward0 returned an invalid gradient at index 0 - got [80, 1] but expected shape compatible with [<=80, 1] , it failed at https://github.com/pytorch/pytorch/blob/912a1f7b2776c0e7ebf9038e4483a4aa709aa893/torch/csrc/autograd/engine.cpp#L812. Stacktrace: https://gist.github.com/vanbasten23/a68180922e9f4c554b92365c961c21a4

* Add ds model test to the ci.

* fix linter

* add verbose flag.

* fix pr comments

* fix linter

* fix the test

* disable the test on CPU.

* fix linter

* Revert "fix linter"

This reverts commit 14e30de.

* Revert "disable the test on CPU."

This reverts commit a49baf4.

* improve tests.

* should expect the test fail.

* make the test show which speicifc test fails via Verbose flag.

* improve the test

* Test on GPU is skipped. Fixed it.

* fix linter
Summary:
This patch reuses more upstream LazyGraphExecutor data structures, including:
1. TlsData.
2. DeviceLocker.
3. DeviceLockerArena.
4. DataCacheArena.

And also reuses the following methods:
1. DeviceBarrier().
2. GetDeviceData().
3. IncTrimCounter().

P.S. It needs pytorch/pytorch#90457 to work.

Test Plan:
CI.
wonjoo-wj and others added 6 commits December 9, 2022 13:47
* Add note [Note: Re-using upstream TensorImpl]

* Run linter

* Remove test file
Summary:
This patch reuses more upstream LazyGraphExecutor data structures, including:
1. TlsData.
2. DeviceLocker.
3. DeviceLockerArena.
4. DataCacheArena.

And also reuses the following methods:
1. DeviceBarrier().
2. GetDeviceData().
3. IncTrimCounter().

P.S. It needs pytorch/pytorch#90598 to work. And it tries to re-land #4303.

Test Plan:
CI.
Summary:
Adds sympy as one of the pytorch deps.

Test Plan:
CI.
* disable all failed PyTorch test

* format
@miladm
Copy link
Collaborator

miladm commented Dec 13, 2022

We have the stack trace pointing to the pytorch/torch/csrc/autograd/engine.cpp error. See the previous comment. @wconstab @ezyang we are wondering if the autograd engine is missing dynamism support. Wdyt?

vanbasten23 and others added 4 commits December 12, 2022 17:38
* Add one more symint expand test.

* fixed the test.
Summary:
This patch inherits LazyGraphExecutor::DeviceContextArena and overrides the follow methods:
1. GetLiveTensors to return XLATensorPtrs.
2. GetRngSeed to use our own + and * for torch::lazy::Value.
3. IrValueFromScalar to use TensorToXlaData.

In addition, it has an extra method: GetBaseSeedData that is used by dynamo.

This patch needs pytorch/pytorch#90531 to function.

Test Plan:
CI.
* Correctly configure PJRT execution when nprocs=1.

Currently there is not easy way to tweak a model written with
pjrt.run_multiprocess to run with single core/device easily.
This is a useful feature when debugging some model failure.

In the XRT world, we can pass the `num_process` as 1 to `xmp.spawn`, and
this should be mirrored in PJRT.

* Format python

* Incorporate review feedback

* Use executor until more review comments come in

* Update pjrt world size test.

* Remove subprocess executors for nprocs=1.

* Cleanup duplicate code and unneeded wrapper.
Summary:
This patch tries to adopt even more LazyGraphExecutor virtual interfaces:
1. LazyGraphExecutor::Async.
2. TensorCollectionBarrier
3. SyncLiveTensorsGraph

And comments on methods that we don't adopt.

This depends on pytorch/pytorch#90650.

Test Plan:
CI.
@ezyang
Copy link
Collaborator

ezyang commented Dec 13, 2022

This error is not one I've seen before. At a guess, CLA's bounded symints don't correctly implement the equality/comparison operators and a test AD engine is doing is failing. If you log symint ops should become clear

Summary:
This patch overrides LazyGraphExecutor::RunPostOrder() and adds a few comments on why some methods are different from the upstream.

This depends on pytorch/pytorch#90680.

Test Plan:
CI
@vanbasten23 vanbasten23 changed the title Test backward pass model ds Test backward pass nn model with dynamic input Dec 13, 2022
@vanbasten23 vanbasten23 changed the title Test backward pass nn model with dynamic input [stale] Test backward pass nn model with dynamic input Dec 13, 2022
@vanbasten23
Copy link
Collaborator Author

i messed up this branch so I created a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.