-
Notifications
You must be signed in to change notification settings - Fork 9.6k
DDP: why does every process allocate memory of GPU 0 and how to avoid it? #969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Line 310 in 0cb38eb
When I remove this line, process 1 no longer allocates memory on GPU 0, so it all happens when error backpropagation. Does anyone have some insights? |
Maybe you use the torch.load() without 'map_location=lambda storage, loc: storage'. The original checkpoint saved the tensor on different GPUs, then the torch.load() will also create another process to map the GPU. |
This https://discuss.pytorch.org/t/extra-10gb-memory-on-gpu-0-in-ddp-tutorial/118113 torch.cuda.set_device(rank) |
This still doesn't seem to be helping in my case :-( |
Just had the same problem and debugged it. You need to put |
Yes, it helps me. |
This solves the "extra 10GB memory on GPU 0" issue (pytorch/examples#969) and is now the recommended way (https://pytorch.org/docs/stable/generated/torch.cuda.set_device.html)
See pytorch#969 Setting CUDA_VISIBLE_DEVICES is the recommended way to handle DDP device ids now (https://pytorch.org/docs/stable/generated/torch.cuda.set_device.html)
See pytorch#969 Setting CUDA_VISIBLE_DEVICES is the recommended way to handle DDP device ids now (https://pytorch.org/docs/stable/generated/torch.cuda.set_device.html)
Run this example with 2 GPUs.
process 2 will allocate some memory on GPU 0.
I have carefully checked the sample code and there seems to be no obvious error that would cause process 2 to transfer data to GPU 0.
So:
Thanks in advance to partners in the PyTorch community for their hard work.
The text was updated successfully, but these errors were encountered: