Skip to content

Conversation

@ktdreyer
Copy link
Contributor

@ktdreyer ktdreyer commented Apr 25, 2025

For instructlab, pip install . does not install vllm, but it does install an uncapped torch (2.7.0 currently).

When we install vllm later, we compile a binary flash_attn wheel against torch-2.7.0. vllm-0.8.4 requires torch==2.6.0, so we downgrade torch, and then we use that with the incompatible flash_attn binary wheel.

To resolve this, use instructlab's constraints-dev.txt in the first pip install operation. This restricts torch to 2.6.0 immediately when we first install instructlab, so that we will compile flash_attn against that torch version.

Fixes: #494

For instructlab, "pip install ." does not install vllm, but it does
install an uncapped torch (2.7.0 currently).

When we install vllm later, we compile a binary flash_attn wheel against
torch 2.7.0. vllm 0.8.4 requires torch==2.6.0, so we downgrade torch,
and then we use that with the incompatible flash_attn binary wheel.

To resolve this, use instructlab's constraints-dev.txt in the first pip
install operation. This restricts torch to 2.6.0 immediately when we
first install instructlab, so that we will compile flash_attn against
that torch version.

Signed-off-by: Ken Dreyer <[email protected]>
@ktdreyer ktdreyer force-pushed the e2e-use-constraints branch from e31adef to b771fc7 Compare April 25, 2025 21:19
@mergify mergify bot added CI/CD Affects CI/CD configuration ci-failure labels Apr 25, 2025
@github-actions
Copy link

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

@github-actions
Copy link

e2e workflow failed on this PR: View run, please investigate.

@ktdreyer
Copy link
Contributor Author

This workflow file was copied to this training repository from https://github.com/instructlab/instructlab/blob/main/.github/workflows/e2e-nvidia-l40s-x4.yml, so this problem exists there also.

I'm fixing that in instructlab/instructlab#3320

Copy link
Contributor

@booxter booxter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct fix. Smoke tests are affected by something else (even if adjacent.)

@mergify mergify bot added the one-approval label Apr 28, 2025
@JamesKunstle JamesKunstle merged commit 65056fa into main Apr 28, 2025
7 of 10 checks passed
@JamesKunstle JamesKunstle deleted the e2e-use-constraints branch April 28, 2025 18:03
@mergify mergify bot removed the one-approval label Apr 28, 2025
@ktdreyer
Copy link
Contributor Author

The E2E (NVIDIA L40S x4) test still fails about three hours in, in NCCL timeout errors. I've filed instructlab/instructlab#3321 to track that problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD Affects CI/CD configuration ci-failure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

e2e large ci failures

4 participants