Skip to content

DDP: why does every process allocate memory of GPU 0 and how to avoid it? #969

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
siaimes opened this issue Mar 8, 2022 · 7 comments
Open

Comments

@siaimes
Copy link
Contributor

siaimes commented Mar 8, 2022

Run this example with 2 GPUs.
process 2 will allocate some memory on GPU 0.

python main.py --multiprocessing-distributed --world-size 1 --rank 0

image

I have carefully checked the sample code and there seems to be no obvious error that would cause process 2 to transfer data to GPU 0.

So:

  1. Why does process 2 allocate memory of GPU 0?
  2. Is this part of the data involved in the calculation? I think if this part of the data is involved in the calculation when the number of processes becomes large, it will cause GPU 0 to be seriously overloaded?
  3. Is there any way to avoid it?

Thanks in advance to partners in the PyTorch community for their hard work.

@siaimes
Copy link
Contributor Author

siaimes commented Mar 18, 2022

loss.backward()

        loss.backward()

When I remove this line, process 1 no longer allocates memory on GPU 0, so it all happens when error backpropagation.

Does anyone have some insights?

@GongZhengLi
Copy link

Maybe you use the torch.load() without 'map_location=lambda storage, loc: storage'. The original checkpoint saved the tensor on different GPUs, then the torch.load() will also create another process to map the GPU.

@kensun0
Copy link

kensun0 commented Oct 18, 2022

This https://discuss.pytorch.org/t/extra-10gb-memory-on-gpu-0-in-ddp-tutorial/118113
solved the problem for me

torch.cuda.set_device(rank)
torch.cuda.empty_cache()

@bhattg
Copy link

bhattg commented Feb 14, 2023

This still doesn't seem to be helping in my case :-(

@hieuhoang
Copy link

hieuhoang commented Dec 3, 2023

Just had the same problem and debugged it. You need to put
torch.cuda.set_device(rank)
before dist.init_process_group()

@d-01
Copy link

d-01 commented Aug 22, 2024

Minimal reproducible example (torch version: 2.3.0+cu118):

# distributed.py
import os
import torch
import torch.distributed as dist

rank, world_size = int(os.environ['RANK']), int(os.environ['WORLD_SIZE'])
dist.init_process_group("nccl", rank=rank, world_size=world_size)
dist.barrier()
time.sleep(36000)

# To run:
# $ torchrun --nproc-per-node=8 --nnodes=1 distributed.py

Same symptoms: each process allocates memory on it's own GPU and for some reason on GPU:0.
Pasted image 20240822114518

I found several solutions to this problem:

  1. Set device using with torch.cuda.device(rank): context (or deprecated global torch.cuda.set_device(rank)) before the first distributed operation (dist.barrier() in this case), before or after init_process_group().
  2. Set environment variable os.environ['CUDA_VISIBLE_DEVICES'] = os.environ['RANK'] before init_process_group().

PS: Presumably (not tested) in multi-node environment LOCAL_RANK have to be used instead of RANK.

@gancx
Copy link

gancx commented Sep 22, 2024

Just had the same problem and debugged it. You need to put torch.cuda.set_device(rank) before dist.init_process_group()

Yes, it helps me.

EIFY added a commit to EIFY/mup-vit that referenced this issue Nov 30, 2024
EIFY added a commit to EIFY/examples that referenced this issue Nov 30, 2024
See pytorch#969
Setting CUDA_VISIBLE_DEVICES is the recommended way to handle DDP
device ids now (https://pytorch.org/docs/stable/generated/torch.cuda.set_device.html)
EIFY added a commit to EIFY/examples that referenced this issue Nov 30, 2024
See pytorch#969
Setting CUDA_VISIBLE_DEVICES is the recommended way to handle DDP
device ids now (https://pytorch.org/docs/stable/generated/torch.cuda.set_device.html)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants