Skip to content

Conversation

@booxter
Copy link
Contributor

@booxter booxter commented Apr 30, 2025

  • chore: bump pytorch to 2.6.0+
  • feat: Rely on implicit detection of distributed backend

@mergify mergify bot added dependencies Pull requests that update a dependency file ci-failure labels Apr 30, 2025
@github-actions
Copy link

github-actions bot commented May 1, 2025

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

@mergify mergify bot removed the ci-failure label May 1, 2025
@github-actions
Copy link

github-actions bot commented May 1, 2025

e2e workflow succeeded on this PR: View run, congrats!

@booxter
Copy link
Contributor Author

booxter commented May 1, 2025

@tiran any particular concerns with this bump of minimal pytorch to 2.6.0+ for training library? (It's already 2.6.0+ in ilab so I'd not expect any, but better double-check...)

@booxter
Copy link
Contributor Author

booxter commented May 6, 2025

As confirmed by Doug H, this won't change versions used downstream. ilab already pulls 2.6.0+ for all flavors.

@booxter booxter marked this pull request as ready for review May 6, 2025 20:09
@booxter booxter requested review from JamesKunstle and RobotSail May 20, 2025 22:34
@mergify mergify bot added the one-approval label May 20, 2025
@mergify
Copy link
Contributor

mergify bot commented May 20, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. @booxter please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 20, 2025
booxter added 2 commits May 21, 2025 08:58
This is in line with ilab repo. There are some features in later pytorch
releases that we may want to have access to.

Signed-off-by: Ihar Hrachyshka <[email protected]>
From the official docs,

```
Since 2.6, if backend is not provided, c10d will use a backend
registered for the device type indicated by the device_id kwarg (if
provided).
```

and:

```
If neither backend nor device_id is provided, c10d will detect the
accelerator on the run-time machine and use a backend registered for
that detected accelerator (or cpu).
```

While the library is still cuda centric, this is one tiny step towards
a more agnostic implementation.

Signed-off-by: Ihar Hrachyshka <[email protected]>
init = functools.partial(torch.distributed.init_process_group, "nccl")
if timeout is not None:
init(timeout=timeout)
torch.distributed.init_process_group(timeout=timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooohhh very nice. I wonder if this helps us get closer to removing our hard-dep on CUDA only training

@mergify mergify bot removed the one-approval label Jun 4, 2025
@booxter booxter removed the request for review from JamesKunstle June 4, 2025 17:45
@mergify mergify bot merged commit 2a1e9b6 into instructlab:main Jun 4, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants