-
Notifications
You must be signed in to change notification settings - Fork 541
Intermittent multiprocessing error on google cloud TPU #3947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Error 1 (Full):
|
Error 2 (Full):
|
Found some additional information on the pytorch lightning docs section on TPUs, which mentions that you should not call Error 3 (Snippet):
Error 4 (Snippet):
|
Error 3 (Full):
The above error is repeated for each TPU device. |
Error 4 (Full):
The following error is repeated multiple times throughout training. The progress and training completion print statements are also printed multiple times. It seems like the model is being trained separately across the cores. The training cell block gets stuck after the completion messages are printed. |
Thanks for reporting, before we dive in too much of the error, can I get some information first?
|
Sure! I think @markcoatsworth could provide some more information about our setup |
I am experiencing the
|
maybe just |
Hi @JackCaoG sorry for the slow reply! We're using TPU VMs. I'm not sure if it's important, but we created them from the gcloud CLI instead of the web console, using the following command:
We're using pytorch and pytorch-xla 1.12, using the |
suggested that one of the process(core) crashed and others can't reach it. You encounter a couple error above, were you able to get it to run eventually? Also are you using Pytorch-lighting? For a sanity test purpose, can you run our resnet test after
and see if it will run? I am trying to figure out if it is a model code issue or infra/config issue. Default config you should use is |
Uh oh!
There was an error while loading. Please reload this page.
❓ Questions and Help
Hello, I've been trying to run a basic MNIST training example on 8 TPU cores on google cloud, but periodically run into the following errors when running the
xmp.spawn
function to begin training:Error 1 (Snippet):
Error 2 (Snippet):
Strangely, these errors will disappear for a while and the code will run fine, and then suddenly pop back up again. The same code was previously running on 8 cores just over an hour ago. Also worth noting is that training seems to work fine on 1 core, with no errors. I'll include the full error logs below.
Any help would be greatly appreciated!
The text was updated successfully, but these errors were encountered: