feat: add medium e2e CI job for each PR #551

cdoern · 2025-05-13T19:38:24Z

This CI job runs the "SDK" of training in conjunction with ilab data generate. The goal here is to test convergence on each PR in a timely manner

.github/workflows/e2e-nvidia-l40s-x4-sdk.yml

JamesKunstle · 2025-05-20T23:10:55Z

.github/workflows/e2e-nvidia-l40s-x4-sdk.yml

+              {"Key": "GitHubPR", "Value": "${{ github.event.number }}"}
+            ]
+
+  e2e-large-test:


these are saying large runner, but I think you're starting a medium runner.

this is our large runner -- l40sx4

its a medium job because it does less, but the runner is large, let me adjust some of these labels

JamesKunstle

I'm not against merging this as a useful test because it's much faster than the large e2e that we use in the upstream ilab repo. However, I think it'd be nicer if we called this an integration-e2e rather than an sdk test. We can isolate the SDK testing slightly differently s.t. we're testing against useful upstream data rather than tightly coupling our testing vs. the SDG output.

booxter · 2025-05-28T13:25:17Z

.github/workflows/e2e-nvidia-l40s-x4-sdk.yml

+
+      - name: Update instructlab-training library
+        run: |
+          export CUDA_HOME="/usr/local/cuda"


please use ./scripts/install-ilab-with-cuda.sh from ilab repo. It should give you the necessary environment.

hmmm, I don't want to install ilab from main though, this is meant to be a step away from having dependencies on ilab main and eventually we don't want any ilab dependency at all. So I'd instead like to maintain our own CUDA installation per repo perhaps? or maybe we should split the action into a install-cuda and install-ilab?

booxter · 2025-05-28T13:26:40Z

.github/workflows/e2e-nvidia-l40s-x4-sdk.yml

+          . venv/bin/activate
+          ls scripts
+          ls ./
+          ./scripts/test-sdk.sh


Except this line, is there any significant difference between the current e2e job and this sdk one? We may want to consolidate it under reusable action, see #563 (I don't insist you adopt it since it's not merged yet, but please put a comment so that we revisit.)

booxter · 2025-05-28T13:30:00Z

scripts/test-data/profile-l40s-x4.yaml

@@ -0,0 +1,157 @@
+chat:


can we trim this file to what was actually modified from the defaults / needed for sdk tests?

I removed chat, but generate, serve, eval, train are needed I believe

booxter · 2025-05-28T13:32:19Z

.github/workflows/e2e-nvidia-l40s-x4-sdk.yml

+  pull_request_target:
+    branches:
+      - "main"
+  schedule:


do we need the schedule? (it was not mentioned in the PR description)

I can remove schedule

This CI job runs the "SDK" of training in conjunction with `ilab data generate`. The goal here is to test convergence on each PR in a timely manner Signed-off-by: Charlie Doern <[email protected]>

cdoern · 2025-06-05T15:59:53Z

@booxter looking at run-e2e/action.yml, that script seems to be tied pretty strongly to installing ilab from main, running the ilab e2e scripts, etc. It might make sense here to keep this workflow as separate as possible, I don't want to connect this e2e to ilab. Right now it uses ilab data generate but I'd pretty much like to remove that ASAP to remove all reps on instructlab/instructlab.

booxter

OK let's iterate on it after it merges. I think the code to install flash-attn stuff may live separately in a script. FYI I think James was looking into splitting it here: #597

cdoern mentioned this pull request May 13, 2025

feat: add medium e2e CI job for each PR #550

Closed

instructlab deleted a comment from mergify bot May 13, 2025

mergify bot added CI/CD Affects CI/CD configuration ci-failure labels May 13, 2025

cdoern force-pushed the new-ci-job branch from a9afcb5 to 40cab04 Compare May 15, 2025 13:49

mergify bot added ci-failure and removed ci-failure labels May 15, 2025

cdoern force-pushed the new-ci-job branch from 40cab04 to 1228c32 Compare May 15, 2025 22:25

mergify bot added ci-failure and removed ci-failure labels May 15, 2025

cdoern force-pushed the new-ci-job branch from 1228c32 to c8793a3 Compare May 16, 2025 12:56

mergify bot added ci-failure and removed ci-failure labels May 16, 2025

cdoern force-pushed the new-ci-job branch from c8793a3 to 9604e28 Compare May 16, 2025 15:02

mergify bot removed the ci-failure label May 16, 2025

JamesKunstle reviewed May 16, 2025

View reviewed changes

.github/workflows/e2e-nvidia-l40s-x4-sdk.yml Show resolved Hide resolved

cdoern force-pushed the new-ci-job branch from 9604e28 to 131a065 Compare May 19, 2025 14:08

JamesKunstle reviewed May 20, 2025

View reviewed changes

booxter reviewed May 28, 2025

View reviewed changes

cdoern force-pushed the new-ci-job branch from 131a065 to 931fc00 Compare June 5, 2025 15:17

mergify bot added the ci-failure label Jun 5, 2025

cdoern force-pushed the new-ci-job branch from 931fc00 to fef130c Compare June 5, 2025 15:21

mergify bot added ci-failure and removed ci-failure labels Jun 5, 2025

cdoern force-pushed the new-ci-job branch from fef130c to 7e8ac7e Compare June 5, 2025 15:27

mergify bot added ci-failure and removed ci-failure labels Jun 5, 2025

cdoern force-pushed the new-ci-job branch from 7e8ac7e to 39160e0 Compare June 5, 2025 15:31

mergify bot removed the ci-failure label Jun 5, 2025

mergify bot added the ci-failure label Jun 5, 2025

cdoern force-pushed the new-ci-job branch from 39160e0 to 3dc143f Compare June 5, 2025 15:46

mergify bot removed the ci-failure label Jun 5, 2025

feat: add medium e2e CI job for each PR

1d6744c

This CI job runs the "SDK" of training in conjunction with `ilab data generate`. The goal here is to test convergence on each PR in a timely manner Signed-off-by: Charlie Doern <[email protected]>

cdoern force-pushed the new-ci-job branch from 3dc143f to 1d6744c Compare June 5, 2025 15:50

mergify bot added the ci-failure label Jun 5, 2025

booxter approved these changes Jun 10, 2025

View reviewed changes

mergify bot added the one-approval label Jun 10, 2025

JamesKunstle approved these changes Jun 10, 2025

View reviewed changes

mergify bot merged commit aacd5b4 into main Jun 10, 2025
24 of 25 checks passed

mergify bot removed the one-approval label Jun 10, 2025

mergify bot deleted the new-ci-job branch June 10, 2025 18:37

mergify bot removed the ci-failure label Jun 10, 2025

feat: add medium e2e CI job for each PR #551

feat: add medium e2e CI job for each PR #551

Uh oh!

Conversation

cdoern commented May 13, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JamesKunstle left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cdoern commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

booxter left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cdoern commented Jun 5, 2025 •

edited

Loading