Skip to content

Conversation

@cdoern
Copy link
Contributor

@cdoern cdoern commented May 13, 2025

This CI job runs the "SDK" of training in conjunction with ilab data generate. The goal here is to test convergence on each PR in a timely manner

@instructlab instructlab deleted a comment from mergify bot May 13, 2025
@mergify mergify bot added CI/CD Affects CI/CD configuration ci-failure labels May 13, 2025
@mergify mergify bot added ci-failure and removed ci-failure labels May 15, 2025
@mergify mergify bot added ci-failure and removed ci-failure labels May 15, 2025
@mergify mergify bot added ci-failure and removed ci-failure labels May 16, 2025
@mergify mergify bot removed the ci-failure label May 16, 2025
{"Key": "GitHubPR", "Value": "${{ github.event.number }}"}
]
e2e-large-test:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are saying large runner, but I think you're starting a medium runner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is our large runner -- l40sx4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its a medium job because it does less, but the runner is large, let me adjust some of these labels

Copy link
Contributor

@JamesKunstle JamesKunstle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against merging this as a useful test because it's much faster than the large e2e that we use in the upstream ilab repo. However, I think it'd be nicer if we called this an integration-e2e rather than an sdk test. We can isolate the SDK testing slightly differently s.t. we're testing against useful upstream data rather than tightly coupling our testing vs. the SDG output.

- name: Update instructlab-training library
run: |
export CUDA_HOME="/usr/local/cuda"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use ./scripts/install-ilab-with-cuda.sh from ilab repo. It should give you the necessary environment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, I don't want to install ilab from main though, this is meant to be a step away from having dependencies on ilab main and eventually we don't want any ilab dependency at all. So I'd instead like to maintain our own CUDA installation per repo perhaps? or maybe we should split the action into a install-cuda and install-ilab?

. venv/bin/activate
ls scripts
ls ./
./scripts/test-sdk.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except this line, is there any significant difference between the current e2e job and this sdk one? We may want to consolidate it under reusable action, see #563 (I don't insist you adopt it since it's not merged yet, but please put a comment so that we revisit.)

@@ -0,0 +1,157 @@
chat:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we trim this file to what was actually modified from the defaults / needed for sdk tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed chat, but generate, serve, eval, train are needed I believe

pull_request_target:
branches:
- "main"
schedule:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the schedule? (it was not mentioned in the PR description)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove schedule

This CI job runs the "SDK" of training in conjunction with `ilab data generate`. The goal here is to test convergence on each PR in a timely manner

Signed-off-by: Charlie Doern <[email protected]>
@cdoern
Copy link
Contributor Author

cdoern commented Jun 5, 2025

@booxter looking at run-e2e/action.yml, that script seems to be tied pretty strongly to installing ilab from main, running the ilab e2e scripts, etc. It might make sense here to keep this workflow as separate as possible, I don't want to connect this e2e to ilab. Right now it uses ilab data generate but I'd pretty much like to remove that ASAP to remove all reps on instructlab/instructlab.

@mergify mergify bot added the ci-failure label Jun 5, 2025
Copy link
Contributor

@booxter booxter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK let's iterate on it after it merges. I think the code to install flash-attn stuff may live separately in a script. FYI I think James was looking into splitting it here: #597

@mergify mergify bot added the one-approval label Jun 10, 2025
@mergify mergify bot merged commit aacd5b4 into main Jun 10, 2025
24 of 25 checks passed
@mergify mergify bot removed the one-approval label Jun 10, 2025
@mergify mergify bot deleted the new-ci-job branch June 10, 2025 18:37
@mergify mergify bot removed the ci-failure label Jun 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI/CD Affects CI/CD configuration

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants