fix: extend nccl timeout #507

cdoern · 2025-04-29T01:11:38Z

extend nccl timeout to combat CI timeouts w/ FSDP optimizer state saving

github-actions · 2025-04-29T01:16:02Z

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

RobotSail · 2025-04-29T03:32:06Z

@cdoern Why do we need to set the NCCL timeout to be so high? If NCCL is hanging, we would want it to fail sooner than later with the exception of debugging. It seems like maybe I'm missing some context here.

github-actions · 2025-04-29T04:53:01Z

e2e workflow succeeded on this PR: View run, congrats!

src/instructlab/training/main_ds.py

cdoern · 2025-04-29T12:29:22Z

@RobotSail , was just testing given some stuff I read online, I am going to be lowering the value until I find a suitable candidate

github-actions · 2025-04-29T12:45:21Z

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

cdoern · 2025-04-29T12:50:03Z

for context @RobotSail , it seems FSDP (particularly fsdp2 if I am understanding correctly) has flaky issues with NCCL timeouts when saving the optimizer state, so arbitrarily giving a higher timeout fixes the issues according to reports with similar issues. I am just trying to see what the timeout should be

github-actions · 2025-04-29T12:51:39Z

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

README.md

src/instructlab/training/main_ds.py

github-actions · 2025-04-29T14:47:47Z

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

booxter · 2025-04-29T15:27:56Z

src/instructlab/training/main_ds.py

    model_conf = AutoConfig.from_pretrained(args.model_name_or_path)
    args.model_type = model_conf.model_type

+    # solution discovered from torchtune https://github.com/pytorch/torchtune/issues/2093


This is good stuff. I think we'll also need a tracker bug in our repo that would document what happens, and link any workaround like this one. Eventually - hopefully - we fix whatever the underlying problem is and close the bug (plus revert any hacks that we make along the way).

Apparently we have instructlab/instructlab#3323 so maybe we can use it for tracking purposes for now. Please link to it.

It would be nice to include a link to issue but I won't hold on it.

tox.ini

github-actions · 2025-04-29T15:38:36Z

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

Signed-off-by: Charlie Doern <[email protected]>

src/instructlab/training/main_ds.py

booxter · 2025-04-29T17:17:30Z

Smoke failure is #505 and not related.

JamesKunstle · 2025-04-29T17:47:59Z

README.md

+
+Below is a list of custom environment variables users can set in the training library.
+
+1. `INSTRUCTLAB_NCCL_TIMEOUT_MS`, this environment variable controls the NCCL timeout in milliseconds. Consider increasing if seeing FSDP related NCCL errors.


Maybe should have the default noted here.

github-actions · 2025-04-29T18:29:20Z

e2e workflow succeeded on this PR: View run, congrats!

github-actions · 2025-04-29T18:30:25Z

e2e workflow failed on this PR: View run, please investigate.

cdoern · 2025-04-29T19:44:36Z

https://github.com/Mergifyio backport release-v0.10

mergify · 2025-04-29T19:44:42Z

backport release-v0.10

✅ Backports have been created

#509 fix: extend nccl timeout (backport #507) has been created for branch release-v0.10

ktdreyer · 2025-04-29T21:10:19Z

For posterity:

One of the CI jobs above failed, and another passed.

@cdoern and @JamesKunstle found that https://github.com/pytorch/pytorch/blob/134179474539648ba7dee1317959529fbd0e7f89/torch/distributed/distributed_c10d.py#L1498 indicates that timeout= is ignored unless this env var is set to 1.

#508 sets this.

mergify bot added the ci-failure label Apr 29, 2025

courtneypacheco reviewed Apr 29, 2025

View reviewed changes

src/instructlab/training/main_ds.py Outdated Show resolved Hide resolved

cdoern force-pushed the nccl-timeout branch from afb6c24 to 8917e38 Compare April 29, 2025 12:40

mergify bot added the documentation Improvements or additions to documentation label Apr 29, 2025

cdoern force-pushed the nccl-timeout branch from 8917e38 to 75d7eb3 Compare April 29, 2025 12:41

mergify bot removed the ci-failure label Apr 29, 2025

mergify bot added the ci-failure label Apr 29, 2025

cdoern force-pushed the nccl-timeout branch from 75d7eb3 to f38a6b7 Compare April 29, 2025 12:46

mergify bot removed the ci-failure label Apr 29, 2025

courtneypacheco requested a review from RobotSail April 29, 2025 12:48

mergify bot added the ci-failure label Apr 29, 2025

booxter suggested changes Apr 29, 2025

View reviewed changes

README.md Outdated Show resolved Hide resolved

README.md Show resolved Hide resolved

src/instructlab/training/main_ds.py Show resolved Hide resolved

src/instructlab/training/main_ds.py Outdated Show resolved Hide resolved

src/instructlab/training/main_ds.py Show resolved Hide resolved

cdoern force-pushed the nccl-timeout branch from f38a6b7 to 8616075 Compare April 29, 2025 14:40

mergify bot removed the ci-failure label Apr 29, 2025

cdoern force-pushed the nccl-timeout branch from 8616075 to 8417627 Compare April 29, 2025 14:46

mergify bot added CI/CD Affects CI/CD configuration testing Relates to testing labels Apr 29, 2025

booxter reviewed Apr 29, 2025

View reviewed changes

tox.ini Outdated Show resolved Hide resolved

cdoern force-pushed the nccl-timeout branch from 8417627 to 7b7fbc6 Compare April 29, 2025 15:39

fix: extend nccl timeout

98a8939

Signed-off-by: Charlie Doern <[email protected]>

cdoern force-pushed the nccl-timeout branch from 7b7fbc6 to 98a8939 Compare April 29, 2025 15:41

ktdreyer mentioned this pull request Apr 29, 2025

e2e tests fail with "NCCL operations have failed or timed out" instructlab/instructlab#3321

Closed

mergify bot added the ci-failure label Apr 29, 2025

booxter reviewed Apr 29, 2025

View reviewed changes

src/instructlab/training/main_ds.py Show resolved Hide resolved

booxter approved these changes Apr 29, 2025

View reviewed changes

mergify bot added the one-approval label Apr 29, 2025

JamesKunstle reviewed Apr 29, 2025

View reviewed changes

JamesKunstle approved these changes Apr 29, 2025

View reviewed changes

mergify bot removed the one-approval label Apr 29, 2025

RobotSail approved these changes Apr 29, 2025

View reviewed changes

mergify bot merged commit 7017853 into instructlab:main Apr 29, 2025
15 of 16 checks passed

mergify bot mentioned this pull request Apr 29, 2025

fix: extend nccl timeout (backport #507) #509

Closed

cdoern mentioned this pull request Apr 30, 2025

Revert "fix: extend nccl timeout" #515

Closed


		Below is a list of custom environment variables users can set in the training library.

		1. `INSTRUCTLAB_NCCL_TIMEOUT_MS`, this environment variable controls the NCCL timeout in milliseconds. Consider increasing if seeing FSDP related NCCL errors.

fix: extend nccl timeout #507

fix: extend nccl timeout #507

Uh oh!

Conversation

cdoern commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 29, 2025

Uh oh!

RobotSail commented Apr 29, 2025

Uh oh!

github-actions bot commented Apr 29, 2025

Uh oh!

Uh oh!

cdoern commented Apr 29, 2025

Uh oh!

github-actions bot commented Apr 29, 2025

Uh oh!

cdoern commented Apr 29, 2025

Uh oh!

github-actions bot commented Apr 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Apr 29, 2025

Uh oh!

booxter Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

booxter Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

booxter Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Apr 29, 2025

Uh oh!

Uh oh!

booxter commented Apr 29, 2025

Uh oh!

JamesKunstle Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Apr 29, 2025

Uh oh!

github-actions bot commented Apr 29, 2025

Uh oh!

cdoern commented Apr 29, 2025

Uh oh!

mergify bot commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Backports have been created

Uh oh!

ktdreyer commented Apr 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cdoern commented Apr 29, 2025 •

edited

Loading

mergify bot commented Apr 29, 2025 •

edited

Loading