-
Notifications
You must be signed in to change notification settings - Fork 7.1k
"CUDA error: invalid argument" on model test for vit_h_14 #7143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thank you Yosua for keeping track of this I opened #7218 to skip this test. This is quite bad, but I don't know how to solve it right now. I can't reproduce it on the AWS cluster either. I suspected that it could be a CUDA version issue: it passes on AWS cluster with CUDA 11.6, while our CI is 11.7 and #7214 shows that it also fails on CI x 11.8. But I can't tell whether CI x 11.6 also fails, and I can't tell whether AWS x 11.7 fails either. #7208 suggests that it's not a memory issue (although it could still be). This could be a CUDA issue, or an upstream core issue. At this point and with all of what we have to do for the release (#7217), our last resort is to skip the test as done in #7218. CC @atalman @malfet , just letting you know that this might be a core or CI issue, I'm just unable to tell at this point. |
I just tried today's cu118 nightly torch and latest vision. The test seems to be able to pass Command: pytest test/test_models.py::test_classification_model[cuda-vit_h_14] -v Also, looking into what is blocking #7221@weiwangmeta update: this was not reproducible on A100 |
Tried cu117 as well. I cannot reproduce it either. pytest test/test_models.py::test_classification_model[cuda-vit_h_14] -v test/test_models.py::test_classification_model[cuda-vit_h_14] PASSED [100%] ================================================================================================ 1 passed, 4 warnings in 25.36s ============================================================================================pip list |grep torch @weiwangmeta this was not reproducible on an A100 |
We should test landing #7221 |
#7515 was resolved. but the test signal is still showing red for one case: https://github.com/pytorch/vision/actions/runs/4729419633/jobs/8405697987?pr=7221 Continuing the investigation... |
Reproduced on an A10G (sm86). |
In summary, this issue could be reproduced on a SM86 GPU (like the one used by CI A10G), but not on SM80 GPU (like A100). |
Also fails on sm_89 (L40 from GCP): (test_vision_on_L40) weiwangmeta@gh-ci-gcp-l4-5:~/vision$ pytest test/test_models.py::test_classification_model[cuda-vit_h_14] -v test/test_models.py::test_classification_model[cuda-vit_h_14] FAILED [100%] |
Re-confirming a100 (SM80) passes (test_vision_on_A100) weiwangmeta@a100:~/vision$ pytest test/test_models.py::test_classification_model[cuda-vit_h_14] -v test/test_models.py::test_classification_model[cuda-vit_h_14] PASSED |
We got the following error on the CI:
This error only happened on
vit_h_14
model in cuda device (the cpu is fine). Also I cannot reproduce the error on AWS cluster machine. Seems like this error is either machine or environment dependent and likely to be pytorch-core issue.Note: I have tried running the test with
CUDA_LAUNCH_BLOCKING=1
but the error trace seems pretty similar (see here):cc @pmeier @seemethere @atalman @osalpekar
The text was updated successfully, but these errors were encountered: