"CUDA error: invalid argument" on model test for vit_h_14 #7143

YosuaMichael · 2023-01-27T10:23:21Z

We got the following error on the CI:

___________________ test_classification_model[cuda-vit_h_14] ___________________
Traceback (most recent call last):
  File "/work/test/test_models.py", line 732, in test_classification_model
    _check_input_backprop(model, x)
  File "/work/test/test_models.py", line 226, in _check_input_backprop
    out[0].sum().backward()
  File "/work/ci_env/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/work/ci_env/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

This error only happened on vit_h_14 model in cuda device (the cpu is fine). Also I cannot reproduce the error on AWS cluster machine. Seems like this error is either machine or environment dependent and likely to be pytorch-core issue.

Note: I have tried running the test with CUDA_LAUNCH_BLOCKING=1 but the error trace seems pretty similar (see here):

___________________ test_classification_model[cuda-vit_h_14] ___________________
Traceback (most recent call last):
  File "/work/test/test_models.py", line 725, in test_classification_model
    _check_input_backprop(model, x)
  File "/work/test/test_models.py", line 225, in _check_input_backprop
    out[0].sum().backward()
  File "/work/ci_env/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/work/ci_env/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: invalid argument
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

cc @pmeier @seemethere @atalman @osalpekar

The text was updated successfully, but these errors were encountered:

NicolasHug · 2023-02-10T15:29:00Z

Thank you Yosua for keeping track of this

I opened #7218 to skip this test.

This is quite bad, but I don't know how to solve it right now. I can't reproduce it on the AWS cluster either. I suspected that it could be a CUDA version issue: it passes on AWS cluster with CUDA 11.6, while our CI is 11.7 and #7214 shows that it also fails on CI x 11.8. But I can't tell whether CI x 11.6 also fails, and I can't tell whether AWS x 11.7 fails either. #7208 suggests that it's not a memory issue (although it could still be).

This could be a CUDA issue, or an upstream core issue. At this point and with all of what we have to do for the release (#7217), our last resort is to skip the test as done in #7218.

CC @atalman @malfet , just letting you know that this might be a core or CI issue, I'm just unable to tell at this point.

weiwangmeta · 2023-04-17T19:59:38Z

I just tried today's cu118 nightly torch and latest vision. The test seems to be able to pass

Command: pytest test/test_models.py::test_classification_model[cuda-vit_h_14] -v
test/test_models.py::test_classification_model[cuda-vit_h_14] PASSED

Also, looking into what is blocking #7221

@weiwangmeta update: this was not reproducible on A100

weiwangmeta · 2023-04-18T06:47:47Z

Tried cu117 as well. I cannot reproduce it either.

pytest test/test_models.py::test_classification_model[cuda-vit_h_14] -v
====================================================================================================== test session starts ======================================================================================================
platform linux -- Python 3.10.10, pytest-7.3.1, pluggy-1.0.0 -- /conda/envs/vision_7143_cu117/bin/python3
cachedir: .pytest_cache
configfile: pytest.ini
plugins: mock-3.10.0
collected 1 item

test/test_models.py::test_classification_model[cuda-vit_h_14] PASSED [100%]

================================================================================================ 1 passed, 4 warnings in 25.36s ============================================================================================pip list |grep torch
pytorch-triton 2.1.0+46672772b4
torch 2.1.0.dev20230417+cu117
torchaudio 2.1.0.dev20230417+cu117
torchvision 0.16.0a0+b78d98b /weiwangmeta/repos/vision

@weiwangmeta this was not reproducible on an A100

weiwangmeta · 2023-04-18T06:48:27Z

We should test landing #7221

weiwangmeta · 2023-04-18T17:05:30Z

Blocked by #7515
This should no longer be an issue. See the test signals for #7221

weiwangmeta · 2023-04-18T18:21:47Z

#7515 was resolved. but the test signal is still showing red for one case: https://github.com/pytorch/vision/actions/runs/4729419633/jobs/8405697987?pr=7221

Continuing the investigation...

weiwangmeta · 2023-04-18T22:06:44Z

Reproduced on an A10G (sm86).

weiwangmeta · 2023-04-18T22:14:41Z

In summary, this issue could be reproduced on a SM86 GPU (like the one used by CI A10G), but not on SM80 GPU (like A100).

weiwangmeta · 2023-04-18T23:18:39Z

Also fails on sm_89 (L40 from GCP):

(test_vision_on_L40) weiwangmeta@gh-ci-gcp-l4-5:~/vision$ pytest test/test_models.py::test_classification_model[cuda-vit_h_14] -v
====================================================================================================== test session starts ======================================================================================================
platform linux -- Python 3.10.10, pytest-7.3.1, pluggy-1.0.0 -- /home/weiwangmeta/miniconda3/envs/test_vision_on_L40/bin/python3
cachedir: .pytest_cache
rootdir: /home/weiwangmeta/vision
configfile: pytest.ini
plugins: mock-3.10.0
collected 1 item

test/test_models.py::test_classification_model[cuda-vit_h_14] FAILED [100%]

weiwangmeta · 2023-04-19T01:38:31Z

Re-confirming a100 (SM80) passes

(test_vision_on_A100) weiwangmeta@a100:~/vision$ pytest test/test_models.py::test_classification_model[cuda-vit_h_14] -v
====================================================================================================== test session starts ======================================================================================================
platform linux -- Python 3.10.10, pytest-7.3.1, pluggy-1.0.0 -- /home/weiwangmeta/miniconda3/envs/test_vision_on_A100/bin/python3
cachedir: .pytest_cache
rootdir: /home/weiwangmeta/vision
configfile: pytest.ini
plugins: mock-3.10.0
collected 1 item

test/test_models.py::test_classification_model[cuda-vit_h_14] PASSED

YosuaMichael added module: tests module: ci core issue labels Jan 27, 2023

YosuaMichael mentioned this issue Jan 27, 2023

Use real weight and image for classification model test and relaxing precision requirement for general model tests #7130

Open

This was referenced Feb 9, 2023

Put back previous tolerance for test_classification and test_video #7202

Merged

TODOs before 0.15 release #7217

Closed

NicolasHug changed the title ~~"CUDA error: invalid argument" on model test~~ "CUDA error: invalid argument" on model test for vit_h_14 Feb 10, 2023

NicolasHug mentioned this issue Feb 10, 2023

Skip model test for vit_h_14 #7218

Merged

weiwangmeta assigned weiwangmeta and osalpekar and unassigned weiwangmeta Feb 21, 2023

izaitsevfb assigned malfet Mar 6, 2023

jeanschmidt assigned weiwangmeta Apr 3, 2023

atalman mentioned this issue May 23, 2023

"CUDA error, AssertionError: Tensor-likes are not close!" on model test for cuda-resnet101 #7618

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"CUDA error: invalid argument" on model test for vit_h_14 #7143

"CUDA error: invalid argument" on model test for vit_h_14 #7143

YosuaMichael commented Jan 27, 2023 •

edited

Loading

NicolasHug commented Feb 10, 2023 •

edited

Loading

weiwangmeta commented Apr 17, 2023 •

edited

Loading

weiwangmeta commented Apr 18, 2023 •

edited

Loading

weiwangmeta commented Apr 18, 2023

weiwangmeta commented Apr 18, 2023

weiwangmeta commented Apr 18, 2023

weiwangmeta commented Apr 18, 2023

weiwangmeta commented Apr 18, 2023 •

edited

Loading

weiwangmeta commented Apr 18, 2023

weiwangmeta commented Apr 19, 2023 •

edited

Loading

"CUDA error: invalid argument" on model test for vit_h_14 #7143

"CUDA error: invalid argument" on model test for vit_h_14 #7143

Comments

YosuaMichael commented Jan 27, 2023 • edited Loading

NicolasHug commented Feb 10, 2023 • edited Loading

weiwangmeta commented Apr 17, 2023 • edited Loading

Also, looking into what is blocking #7221

weiwangmeta commented Apr 18, 2023 • edited Loading

weiwangmeta commented Apr 18, 2023

weiwangmeta commented Apr 18, 2023

weiwangmeta commented Apr 18, 2023

weiwangmeta commented Apr 18, 2023

weiwangmeta commented Apr 18, 2023 • edited Loading

weiwangmeta commented Apr 18, 2023

weiwangmeta commented Apr 19, 2023 • edited Loading

YosuaMichael commented Jan 27, 2023 •

edited

Loading

NicolasHug commented Feb 10, 2023 •

edited

Loading

weiwangmeta commented Apr 17, 2023 •

edited

Loading

weiwangmeta commented Apr 18, 2023 •

edited

Loading

weiwangmeta commented Apr 18, 2023 •

edited

Loading

weiwangmeta commented Apr 19, 2023 •

edited

Loading