DDP: why does every process allocate memory of GPU 0 and how to avoid it? #969

siaimes · 2022-03-08T13:41:16Z

Run this example with 2 GPUs.
process 2 will allocate some memory on GPU 0.

python main.py --multiprocessing-distributed --world-size 1 --rank 0

I have carefully checked the sample code and there seems to be no obvious error that would cause process 2 to transfer data to GPU 0.

So:

Why does process 2 allocate memory of GPU 0?
Is this part of the data involved in the calculation? I think if this part of the data is involved in the calculation when the number of processes becomes large, it will cause GPU 0 to be seriously overloaded?
Is there any way to avoid it?

Thanks in advance to partners in the PyTorch community for their hard work.

The text was updated successfully, but these errors were encountered:

siaimes · 2022-03-18T03:35:59Z

examples/imagenet/main.py

Line 310 in 0cb38eb

loss.backward()

        loss.backward()

When I remove this line, process 1 no longer allocates memory on GPU 0, so it all happens when error backpropagation.

Does anyone have some insights?

GongZhengLi · 2022-07-28T02:19:47Z

Maybe you use the torch.load() without 'map_location=lambda storage, loc: storage'. The original checkpoint saved the tensor on different GPUs, then the torch.load() will also create another process to map the GPU.

kensun0 · 2022-10-18T08:03:39Z

This https://discuss.pytorch.org/t/extra-10gb-memory-on-gpu-0-in-ddp-tutorial/118113
solved the problem for me

torch.cuda.set_device(rank)
torch.cuda.empty_cache()

bhattg · 2023-02-14T20:59:57Z

This still doesn't seem to be helping in my case :-(

hieuhoang · 2023-12-03T01:20:42Z

Just had the same problem and debugged it. You need to put
torch.cuda.set_device(rank)
before dist.init_process_group()

d-01 · 2024-08-22T13:13:45Z

Minimal reproducible example (torch version: 2.3.0+cu118):

# distributed.py
import os
import torch
import torch.distributed as dist

rank, world_size = int(os.environ['RANK']), int(os.environ['WORLD_SIZE'])
dist.init_process_group("nccl", rank=rank, world_size=world_size)
dist.barrier()
time.sleep(36000)

# To run:
# $ torchrun --nproc-per-node=8 --nnodes=1 distributed.py

Same symptoms: each process allocates memory on it's own GPU and for some reason on GPU:0.

I found several solutions to this problem:

Set device using with torch.cuda.device(rank): context (or deprecated global torch.cuda.set_device(rank)) before the first distributed operation (dist.barrier() in this case), before or after init_process_group().
Set environment variable os.environ['CUDA_VISIBLE_DEVICES'] = os.environ['RANK'] before init_process_group().

PS: Presumably (not tested) in multi-node environment LOCAL_RANK have to be used instead of RANK.

gancx · 2024-09-22T11:41:25Z

Just had the same problem and debugged it. You need to put torch.cuda.set_device(rank) before dist.init_process_group()

Yes, it helps me.

This solves the "extra 10GB memory on GPU 0" issue (pytorch/examples#969) and is now the recommended way (https://pytorch.org/docs/stable/generated/torch.cuda.set_device.html)

See pytorch#969 Setting CUDA_VISIBLE_DEVICES is the recommended way to handle DDP device ids now (https://pytorch.org/docs/stable/generated/torch.cuda.set_device.html)

subramen added distributed need review labels Mar 9, 2022

msaroufim removed the need review label Mar 9, 2022

EIFY added a commit to EIFY/examples that referenced this issue Nov 30, 2024

fix device handling

0277c40

See pytorch#969 Setting CUDA_VISIBLE_DEVICES is the recommended way to handle DDP device ids now (https://pytorch.org/docs/stable/generated/torch.cuda.set_device.html)

EIFY added a commit to EIFY/examples that referenced this issue Nov 30, 2024

fix device handling

d10a469

See pytorch#969 Setting CUDA_VISIBLE_DEVICES is the recommended way to handle DDP device ids now (https://pytorch.org/docs/stable/generated/torch.cuda.set_device.html)

EIFY mentioned this issue Dec 1, 2024

Better device handling #1301

Open

caic99 mentioned this issue Feb 11, 2025

[BUG] PT parallel training neighbor stat OOM deepmodeling/deepmd-kit#4594

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP: why does every process allocate memory of GPU 0 and how to avoid it? #969

DDP: why does every process allocate memory of GPU 0 and how to avoid it? #969

siaimes commented Mar 8, 2022 •

edited

Loading

siaimes commented Mar 18, 2022

GongZhengLi commented Jul 28, 2022

kensun0 commented Oct 18, 2022 •

edited

Loading

bhattg commented Feb 14, 2023

hieuhoang commented Dec 3, 2023 •

edited

Loading

d-01 commented Aug 22, 2024 •

edited

Loading

gancx commented Sep 22, 2024

DDP: why does every process allocate memory of GPU 0 and how to avoid it? #969

DDP: why does every process allocate memory of GPU 0 and how to avoid it? #969

Comments

siaimes commented Mar 8, 2022 • edited Loading

siaimes commented Mar 18, 2022

GongZhengLi commented Jul 28, 2022

kensun0 commented Oct 18, 2022 • edited Loading

bhattg commented Feb 14, 2023

hieuhoang commented Dec 3, 2023 • edited Loading

d-01 commented Aug 22, 2024 • edited Loading

gancx commented Sep 22, 2024

siaimes commented Mar 8, 2022 •

edited

Loading

kensun0 commented Oct 18, 2022 •

edited

Loading

hieuhoang commented Dec 3, 2023 •

edited

Loading

d-01 commented Aug 22, 2024 •

edited

Loading