Skip to content

Conversation

@cdoern
Copy link
Contributor

@cdoern cdoern commented Apr 29, 2025

extend nccl timeout to combat CI timeouts w/ FSDP optimizer state saving

@github-actions
Copy link

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

@mergify mergify bot added the ci-failure label Apr 29, 2025
@RobotSail
Copy link
Member

@cdoern Why do we need to set the NCCL timeout to be so high? If NCCL is hanging, we would want it to fail sooner than later with the exception of debugging. It seems like maybe I'm missing some context here.

@github-actions
Copy link

e2e workflow succeeded on this PR: View run, congrats!

@cdoern
Copy link
Contributor Author

cdoern commented Apr 29, 2025

@RobotSail , was just testing given some stuff I read online, I am going to be lowering the value until I find a suitable candidate

@mergify mergify bot added the documentation Improvements or additions to documentation label Apr 29, 2025
@mergify mergify bot removed the ci-failure label Apr 29, 2025
@github-actions
Copy link

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

@cdoern
Copy link
Contributor Author

cdoern commented Apr 29, 2025

for context @RobotSail , it seems FSDP (particularly fsdp2 if I am understanding correctly) has flaky issues with NCCL timeouts when saving the optimizer state, so arbitrarily giving a higher timeout fixes the issues according to reports with similar issues. I am just trying to see what the timeout should be

@github-actions
Copy link

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

@mergify mergify bot added the ci-failure label Apr 29, 2025
@mergify mergify bot removed the ci-failure label Apr 29, 2025
@mergify mergify bot added CI/CD Affects CI/CD configuration testing Relates to testing labels Apr 29, 2025
@github-actions
Copy link

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

model_conf = AutoConfig.from_pretrained(args.model_name_or_path)
args.model_type = model_conf.model_type

# solution discovered from torchtune https://github.com/pytorch/torchtune/issues/2093
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good stuff. I think we'll also need a tracker bug in our repo that would document what happens, and link any workaround like this one. Eventually - hopefully - we fix whatever the underlying problem is and close the bug (plus revert any hacks that we make along the way).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently we have instructlab/instructlab#3323 so maybe we can use it for tracking purposes for now. Please link to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to include a link to issue but I won't hold on it.

@github-actions
Copy link

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

Signed-off-by: Charlie Doern <[email protected]>
@mergify mergify bot added the one-approval label Apr 29, 2025
@booxter
Copy link
Contributor

booxter commented Apr 29, 2025

Smoke failure is #505 and not related.


Below is a list of custom environment variables users can set in the training library.

1. `INSTRUCTLAB_NCCL_TIMEOUT_MS`, this environment variable controls the NCCL timeout in milliseconds. Consider increasing if seeing FSDP related NCCL errors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe should have the default noted here.

@mergify mergify bot removed the one-approval label Apr 29, 2025
@mergify mergify bot merged commit 7017853 into instructlab:main Apr 29, 2025
15 of 16 checks passed
@github-actions
Copy link

e2e workflow succeeded on this PR: View run, congrats!

@github-actions
Copy link

e2e workflow failed on this PR: View run, please investigate.

@cdoern
Copy link
Contributor Author

cdoern commented Apr 29, 2025

https://github.com/Mergifyio backport release-v0.10

@mergify
Copy link
Contributor

mergify bot commented Apr 29, 2025

backport release-v0.10

✅ Backports have been created

@ktdreyer
Copy link
Contributor

For posterity:

One of the CI jobs above failed, and another passed.

@cdoern and @JamesKunstle found that https://github.com/pytorch/pytorch/blob/134179474539648ba7dee1317959529fbd0e7f89/torch/distributed/distributed_c10d.py#L1498 indicates that timeout= is ignored unless this env var is set to 1.

#508 sets this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD Affects CI/CD configuration ci-failure documentation Improvements or additions to documentation testing Relates to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants