-
Notifications
You must be signed in to change notification settings - Fork 74
feat: add medium e2e CI job for each PR #551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| {"Key": "GitHubPR", "Value": "${{ github.event.number }}"} | ||
| ] | ||
| e2e-large-test: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are saying large runner, but I think you're starting a medium runner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is our large runner -- l40sx4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its a medium job because it does less, but the runner is large, let me adjust some of these labels
JamesKunstle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not against merging this as a useful test because it's much faster than the large e2e that we use in the upstream ilab repo. However, I think it'd be nicer if we called this an integration-e2e rather than an sdk test. We can isolate the SDK testing slightly differently s.t. we're testing against useful upstream data rather than tightly coupling our testing vs. the SDG output.
| - name: Update instructlab-training library | ||
| run: | | ||
| export CUDA_HOME="/usr/local/cuda" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use ./scripts/install-ilab-with-cuda.sh from ilab repo. It should give you the necessary environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, I don't want to install ilab from main though, this is meant to be a step away from having dependencies on ilab main and eventually we don't want any ilab dependency at all. So I'd instead like to maintain our own CUDA installation per repo perhaps? or maybe we should split the action into a install-cuda and install-ilab?
| . venv/bin/activate | ||
| ls scripts | ||
| ls ./ | ||
| ./scripts/test-sdk.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except this line, is there any significant difference between the current e2e job and this sdk one? We may want to consolidate it under reusable action, see #563 (I don't insist you adopt it since it's not merged yet, but please put a comment so that we revisit.)
| @@ -0,0 +1,157 @@ | |||
| chat: | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we trim this file to what was actually modified from the defaults / needed for sdk tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed chat, but generate, serve, eval, train are needed I believe
| pull_request_target: | ||
| branches: | ||
| - "main" | ||
| schedule: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need the schedule? (it was not mentioned in the PR description)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove schedule
This CI job runs the "SDK" of training in conjunction with `ilab data generate`. The goal here is to test convergence on each PR in a timely manner Signed-off-by: Charlie Doern <[email protected]>
|
@booxter looking at |
booxter
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK let's iterate on it after it merges. I think the code to install flash-attn stuff may live separately in a script. FYI I think James was looking into splitting it here: #597
This CI job runs the "SDK" of training in conjunction with
ilab data generate. The goal here is to test convergence on each PR in a timely manner