Implicit distributed backend selection #516

booxter · 2025-04-30T23:26:25Z

chore: bump pytorch to 2.6.0+
feat: Rely on implicit detection of distributed backend

github-actions · 2025-05-01T00:48:28Z

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

github-actions · 2025-05-01T04:29:00Z

e2e workflow succeeded on this PR: View run, congrats!

booxter · 2025-05-01T12:45:18Z

@tiran any particular concerns with this bump of minimal pytorch to 2.6.0+ for training library? (It's already 2.6.0+ in ilab so I'd not expect any, but better double-check...)

booxter · 2025-05-06T20:09:19Z

As confirmed by Doug H, this won't change versions used downstream. ilab already pulls 2.6.0+ for all flavors.

mergify · 2025-05-20T23:43:36Z

This pull request has merge conflicts that must be resolved before it can be
merged. @booxter please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

This is in line with ilab repo. There are some features in later pytorch releases that we may want to have access to. Signed-off-by: Ihar Hrachyshka <[email protected]>

From the official docs, ``` Since 2.6, if backend is not provided, c10d will use a backend registered for the device type indicated by the device_id kwarg (if provided). ``` and: ``` If neither backend nor device_id is provided, c10d will detect the accelerator on the run-time machine and use a backend registered for that detected accelerator (or cpu). ``` While the library is still cuda centric, this is one tiny step towards a more agnostic implementation. Signed-off-by: Ihar Hrachyshka <[email protected]>

cdoern · 2025-06-04T17:43:28Z

src/instructlab/training/main_ds.py

-    init = functools.partial(torch.distributed.init_process_group, "nccl")
    if timeout is not None:
-        init(timeout=timeout)
+        torch.distributed.init_process_group(timeout=timeout)


ooohhh very nice. I wonder if this helps us get closer to removing our hard-dep on CUDA only training

mergify bot added dependencies Pull requests that update a dependency file ci-failure labels Apr 30, 2025

mergify bot removed the ci-failure label May 1, 2025

booxter marked this pull request as ready for review May 6, 2025 20:09

booxter requested review from JamesKunstle and RobotSail May 20, 2025 22:34

RobotSail approved these changes May 20, 2025

View reviewed changes

mergify bot added the one-approval label May 20, 2025

mergify bot added the needs-rebase label May 20, 2025

booxter added 2 commits May 21, 2025 08:58

chore: bump pytorch to 2.6.0+

8eb9778

This is in line with ilab repo. There are some features in later pytorch releases that we may want to have access to. Signed-off-by: Ihar Hrachyshka <[email protected]>

booxter force-pushed the pull-req branch from 3c78aee to 4be7e28 Compare May 21, 2025 12:59

mergify bot removed the needs-rebase label May 21, 2025

cdoern approved these changes Jun 4, 2025

View reviewed changes

mergify bot removed the one-approval label Jun 4, 2025

booxter removed the request for review from JamesKunstle June 4, 2025 17:45

mergify bot merged commit 2a1e9b6 into instructlab:main Jun 4, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implicit distributed backend selection #516

Implicit distributed backend selection #516

Uh oh!

booxter commented Apr 30, 2025

Uh oh!

github-actions bot commented May 1, 2025

Uh oh!

github-actions bot commented May 1, 2025

Uh oh!

booxter commented May 1, 2025

Uh oh!

booxter commented May 6, 2025

Uh oh!

mergify bot commented May 20, 2025

Uh oh!

cdoern Jun 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Implicit distributed backend selection #516

Implicit distributed backend selection #516

Uh oh!

Conversation

booxter commented Apr 30, 2025

Uh oh!

github-actions bot commented May 1, 2025

Uh oh!

github-actions bot commented May 1, 2025

Uh oh!

booxter commented May 1, 2025

Uh oh!

booxter commented May 6, 2025

Uh oh!

mergify bot commented May 20, 2025

Uh oh!

cdoern Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants