Skip to content

"CUDA error: invalid argument" on model test for vit_h_14 #7143

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
YosuaMichael opened this issue Jan 27, 2023 · 10 comments
Open

"CUDA error: invalid argument" on model test for vit_h_14 #7143

YosuaMichael opened this issue Jan 27, 2023 · 10 comments

Comments

@YosuaMichael
Copy link
Contributor

YosuaMichael commented Jan 27, 2023

We got the following error on the CI:

___________________ test_classification_model[cuda-vit_h_14] ___________________
Traceback (most recent call last):
  File "/work/test/test_models.py", line 732, in test_classification_model
    _check_input_backprop(model, x)
  File "/work/test/test_models.py", line 226, in _check_input_backprop
    out[0].sum().backward()
  File "/work/ci_env/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/work/ci_env/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

This error only happened on vit_h_14 model in cuda device (the cpu is fine). Also I cannot reproduce the error on AWS cluster machine. Seems like this error is either machine or environment dependent and likely to be pytorch-core issue.

Note: I have tried running the test with CUDA_LAUNCH_BLOCKING=1 but the error trace seems pretty similar (see here):

___________________ test_classification_model[cuda-vit_h_14] ___________________
Traceback (most recent call last):
  File "/work/test/test_models.py", line 725, in test_classification_model
    _check_input_backprop(model, x)
  File "/work/test/test_models.py", line 225, in _check_input_backprop
    out[0].sum().backward()
  File "/work/ci_env/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/work/ci_env/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: invalid argument
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

cc @pmeier @seemethere @atalman @osalpekar

@NicolasHug
Copy link
Member

NicolasHug commented Feb 10, 2023

Thank you Yosua for keeping track of this

I opened #7218 to skip this test.

This is quite bad, but I don't know how to solve it right now. I can't reproduce it on the AWS cluster either. I suspected that it could be a CUDA version issue: it passes on AWS cluster with CUDA 11.6, while our CI is 11.7 and #7214 shows that it also fails on CI x 11.8. But I can't tell whether CI x 11.6 also fails, and I can't tell whether AWS x 11.7 fails either. #7208 suggests that it's not a memory issue (although it could still be).

This could be a CUDA issue, or an upstream core issue. At this point and with all of what we have to do for the release (#7217), our last resort is to skip the test as done in #7218.

CC @atalman @malfet , just letting you know that this might be a core or CI issue, I'm just unable to tell at this point.

@weiwangmeta
Copy link

weiwangmeta commented Apr 17, 2023

I just tried today's cu118 nightly torch and latest vision. The test seems to be able to pass

Command: pytest test/test_models.py::test_classification_model[cuda-vit_h_14] -v
test/test_models.py::test_classification_model[cuda-vit_h_14] PASSED

Also, looking into what is blocking #7221

@weiwangmeta update: this was not reproducible on A100

@weiwangmeta
Copy link

weiwangmeta commented Apr 18, 2023

Tried cu117 as well. I cannot reproduce it either.

pytest test/test_models.py::test_classification_model[cuda-vit_h_14] -v
====================================================================================================== test session starts ======================================================================================================
platform linux -- Python 3.10.10, pytest-7.3.1, pluggy-1.0.0 -- /conda/envs/vision_7143_cu117/bin/python3
cachedir: .pytest_cache
configfile: pytest.ini
plugins: mock-3.10.0
collected 1 item

test/test_models.py::test_classification_model[cuda-vit_h_14] PASSED [100%]

================================================================================================ 1 passed, 4 warnings in 25.36s ============================================================================================pip list |grep torch
pytorch-triton 2.1.0+46672772b4
torch 2.1.0.dev20230417+cu117
torchaudio 2.1.0.dev20230417+cu117
torchvision 0.16.0a0+b78d98b /weiwangmeta/repos/vision


@weiwangmeta this was not reproducible on an A100

@weiwangmeta
Copy link

We should test landing #7221

@weiwangmeta
Copy link

Blocked by #7515
This should no longer be an issue. See the test signals for #7221

@weiwangmeta
Copy link

#7515 was resolved. but the test signal is still showing red for one case: https://github.com/pytorch/vision/actions/runs/4729419633/jobs/8405697987?pr=7221

Continuing the investigation...

@weiwangmeta
Copy link

Reproduced on an A10G (sm86).

@weiwangmeta
Copy link

weiwangmeta commented Apr 18, 2023

In summary, this issue could be reproduced on a SM86 GPU (like the one used by CI A10G), but not on SM80 GPU (like A100).

@weiwangmeta
Copy link

Also fails on sm_89 (L40 from GCP):

(test_vision_on_L40) weiwangmeta@gh-ci-gcp-l4-5:~/vision$ pytest test/test_models.py::test_classification_model[cuda-vit_h_14] -v
====================================================================================================== test session starts ======================================================================================================
platform linux -- Python 3.10.10, pytest-7.3.1, pluggy-1.0.0 -- /home/weiwangmeta/miniconda3/envs/test_vision_on_L40/bin/python3
cachedir: .pytest_cache
rootdir: /home/weiwangmeta/vision
configfile: pytest.ini
plugins: mock-3.10.0
collected 1 item

test/test_models.py::test_classification_model[cuda-vit_h_14] FAILED [100%]

@weiwangmeta
Copy link

weiwangmeta commented Apr 19, 2023

Re-confirming a100 (SM80) passes

(test_vision_on_A100) weiwangmeta@a100:~/vision$ pytest test/test_models.py::test_classification_model[cuda-vit_h_14] -v
====================================================================================================== test session starts ======================================================================================================
platform linux -- Python 3.10.10, pytest-7.3.1, pluggy-1.0.0 -- /home/weiwangmeta/miniconda3/envs/test_vision_on_A100/bin/python3
cachedir: .pytest_cache
rootdir: /home/weiwangmeta/vision
configfile: pytest.ini
plugins: mock-3.10.0
collected 1 item

test/test_models.py::test_classification_model[cuda-vit_h_14] PASSED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants