-
Notifications
You must be signed in to change notification settings - Fork 1
Add timeout to init_process_group
in entrypoint
#43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I guess they added https://pytorch.org/docs/stable/distributed.html#torch.distributed.monitored_barrier I still think we can reduce this default, but allow users to override if desired. |
Oh yeah, that'd be great. I was thinking barriers might be an issue. I think this is a PyTorch issue really.
Yeah that sounds reasonable. |
Let's actually keep the same default but allow users to override :) |
Question: what's the longest a distributed operation should reasonably take?
How long would it take to "all-gather" a large amount of memory (like 80 GB)?
Let's set a smaller default timeout... maybe 180 seconds?
And then we can pass an argument to override this.
torchrunx/src/torchrunx/agent.py
Lines 83 to 85 in cd1a895
The text was updated successfully, but these errors were encountered: