-
Notifications
You must be signed in to change notification settings - Fork 74
fix: extend nccl timeout #507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
|
@cdoern Why do we need to set the NCCL timeout to be so high? If NCCL is hanging, we would want it to fail sooner than later with the exception of debugging. It seems like maybe I'm missing some context here. |
|
e2e workflow succeeded on this PR: View run, congrats! |
|
@RobotSail , was just testing given some stuff I read online, I am going to be lowering the value until I find a suitable candidate |
|
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
|
for context @RobotSail , it seems FSDP (particularly fsdp2 if I am understanding correctly) has flaky issues with NCCL timeouts when saving the optimizer state, so arbitrarily giving a higher timeout fixes the issues according to reports with similar issues. I am just trying to see what the timeout should be |
|
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
|
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
| model_conf = AutoConfig.from_pretrained(args.model_name_or_path) | ||
| args.model_type = model_conf.model_type | ||
|
|
||
| # solution discovered from torchtune https://github.com/pytorch/torchtune/issues/2093 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good stuff. I think we'll also need a tracker bug in our repo that would document what happens, and link any workaround like this one. Eventually - hopefully - we fix whatever the underlying problem is and close the bug (plus revert any hacks that we make along the way).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently we have instructlab/instructlab#3323 so maybe we can use it for tracking purposes for now. Please link to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to include a link to issue but I won't hold on it.
|
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
Signed-off-by: Charlie Doern <[email protected]>
|
Smoke failure is #505 and not related. |
|
|
||
| Below is a list of custom environment variables users can set in the training library. | ||
|
|
||
| 1. `INSTRUCTLAB_NCCL_TIMEOUT_MS`, this environment variable controls the NCCL timeout in milliseconds. Consider increasing if seeing FSDP related NCCL errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe should have the default noted here.
|
e2e workflow succeeded on this PR: View run, congrats! |
|
e2e workflow failed on this PR: View run, please investigate. |
|
https://github.com/Mergifyio backport release-v0.10 |
✅ Backports have been created
|
|
For posterity: One of the CI jobs above failed, and another passed. @cdoern and @JamesKunstle found that https://github.com/pytorch/pytorch/blob/134179474539648ba7dee1317959529fbd0e7f89/torch/distributed/distributed_c10d.py#L1498 indicates that #508 sets this. |
extend nccl timeout to combat CI timeouts w/ FSDP optimizer state saving