Add timeout to `init_process_group` in `entrypoint` #43

apoorvkh · 2024-07-16T18:42:36Z

Question: what's the longest a distributed operation should reasonably take?
How long would it take to "all-gather" a large amount of memory (like 80 GB)?

Let's set a smaller default timeout... maybe 180 seconds?
And then we can pass an argument to override this.

torchrunx/src/torchrunx/agent.py

Lines 83 to 85 in cd1a895

    
           dist.init_process_group( 
        
               backend=backend, world_size=worker_args.world_size, rank=worker_args.rank, store=store 
        
           )

apoorvkh · 2024-07-16T18:46:17Z

pytorch/pytorch#13056 (comment)

apoorvkh · 2024-07-16T18:52:13Z

I guess they added monitored_barrier but it only works for GLOO.

https://pytorch.org/docs/stable/distributed.html#torch.distributed.monitored_barrier

I still think we can reduce this default, but allow users to override if desired.

pmcurtin · 2024-07-17T15:59:47Z

pytorch/pytorch#13056 (comment)

Oh yeah, that'd be great. I was thinking barriers might be an issue. I think this is a PyTorch issue really.

I still think we can reduce this default, but allow users to override if desired.

Yeah that sounds reasonable.

apoorvkh · 2024-07-17T16:06:43Z

Let's actually keep the same default but allow users to override :)

pmcurtin mentioned this issue Jul 17, 2024

add pg_timeout flag #44

Merged

apoorvkh linked a pull request Jul 22, 2024 that will close this issue

add pg_timeout flag #44

Merged

apoorvkh closed this as completed Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add timeout to `init_process_group` in `entrypoint` #43

Add timeout to `init_process_group` in `entrypoint` #43

apoorvkh commented Jul 16, 2024

apoorvkh commented Jul 16, 2024

apoorvkh commented Jul 16, 2024

pmcurtin commented Jul 17, 2024

apoorvkh commented Jul 17, 2024

Add timeout to init_process_group in entrypoint #43

Add timeout to init_process_group in entrypoint #43

Comments

apoorvkh commented Jul 16, 2024

apoorvkh commented Jul 16, 2024

apoorvkh commented Jul 16, 2024

pmcurtin commented Jul 17, 2024

apoorvkh commented Jul 17, 2024

Add timeout to `init_process_group` in `entrypoint` #43

Add timeout to `init_process_group` in `entrypoint` #43