feat: support for arm based trainer image using arm runner #3046
feat: support for arm based trainer image using arm runner #3046jaiakash wants to merge 4 commits into
Conversation
Pull Request Test Coverage Report for Build 20409798999Details
💛 - Coveralls |
|
Hi @tenzen-y @astefanutti Can you please take a look at this PR. |
|
@jaiakash thanks! /lgtm /assign @kubeflow/kubeflow-trainer-team |
|
Hi @tenzen-y @andreyvelich PTAL. |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
c9b59d6 to
1e36d16
Compare
|
New changes are detected. LGTM label has been removed. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR restructures the container image build pipeline to build linux/arm64 images on an ARM runner (instead of QEMU emulation), and updates the publishing flow to merge per-arch outputs into multi-arch manifests.
Changes:
- Replaced the composite action with a reusable workflow to build/push per-architecture images and upload digests as artifacts.
- Updated the main image workflow to run a component×arch matrix (x86 + arm64) and added a follow-up job to merge multi-arch manifests from the uploaded digests.
- Dropped
linux/ppc64lefromtrainer-controller-managerbuilds.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
.github/workflows/template-publish-image/publish-images.yaml |
New reusable workflow that builds per-arch images, optionally pushes, and uploads digests for manifest merging. |
.github/workflows/template-publish-image/action.yaml |
Removed the old composite action implementation. |
.github/workflows/build-and-push-images.yaml |
Switched to distributed builds across x86/arm runners and added a manifest-merge job. |
1e36d16 to
96f6e02
Compare
|
Hi I have updated the PR with latest changes and here are few data points.
Sample runon my fork can be seen here: https://github.com/jaiakash/trainer/actions/runs/25131769251/job/73659612821 Example Imagesdataset-initializer https://hub.docker.com/r/akashjaiswal03/dataset-initializer/tags model-initializer |
| secrets: | ||
| DOCKERHUB_USERNAME: | ||
| required: true | ||
| DOCKERHUB_TOKEN: | ||
| required: true | ||
|
|
There was a problem hiding this comment.
The reusable workflow marks Docker Hub secrets as required, which will cause pull_request runs (especially from forks) to fail even though the Docker Hub login steps are skipped; make these secrets optional and conditionally run Docker Hub login/push steps only when the secrets are present and publishing is intended.
| merge-manifests: | ||
| needs: build-and-publish | ||
| if: always() | ||
| runs-on: ubuntu-latest | ||
| strategy: | ||
| fail-fast: false | ||
| matrix: | ||
| component: | ||
| - name: trainer-controller-manager | ||
| - name: model-initializer | ||
| - name: dataset-initializer | ||
| - name: deepspeed-runtime | ||
| - name: xgboost-runtime | ||
| - name: mlx-runtime | ||
| - name: torchtune-trainer | ||
| - name: data-cache | ||
|
|
||
| steps: | ||
| - name: Checkout | ||
| uses: actions/checkout@v6 | ||
| - name: Download digests | ||
| uses: actions/download-artifact@v4 | ||
| with: | ||
| path: /tmp/digests | ||
| pattern: digests-${{ matrix.component.name }}-* | ||
| merge-multiple: true | ||
|
|
There was a problem hiding this comment.
The merge-manifests job is configured with if: always(), but on pull_request runs no digest artifacts are uploaded, so download-artifact and the subsequent imagetools command will fail; gate this job (and the registry login steps) to only run when publishing (e.g., github.event_name == 'push') and ideally only when the build matrix succeeded.
| exclude: | ||
| - component: | ||
| name: mlx-runtime | ||
| arch: | ||
| name: arm64 | ||
|
|
||
| - component: | ||
| name: torchtune-trainer | ||
| arch: | ||
| name: arm64 | ||
|
|
||
| uses: ./.github/workflows/publish-images.yaml | ||
| with: | ||
| component-name: ${{ matrix.component.name }} | ||
| dockerfile: ${{ matrix.component.dockerfile }} | ||
| platforms: ${{ matrix.component[matrix.arch.platforms_key] }} |
There was a problem hiding this comment.
The matrix exclude entries likely won’t match because component/arch are object values and the exclude only specifies name (not the full object), which can leave arm64 jobs for components without arm_platforms and cause ${{ matrix.component[matrix.arch.platforms_key] }} to evaluate to null/throw; consider restructuring the matrix to enumerate valid (component, arch) pairs via include, or ensure all required object fields are present in the exclude match.
|
/retest |
andreyvelich
left a comment
There was a problem hiding this comment.
@jaiakash Any thoughts why CI is failing?
| # Each invocation builds a single platform natively on the specified runner, | ||
| # pushes by digest (no tag), and uploads the digest as an artifact. | ||
| # The caller workflow assembles the multi-arch manifest in a separate merge job. | ||
| name: Build And Publish Container Images |
There was a problem hiding this comment.
Do we want to keep this in the template-publish-image/action.yaml?
| # TODO: (jaiakash) pytorch/pytorch CUDA base image does not support arm64 | ||
| # Check this https://github.com/orgs/pytorch/packages/container/pytorch/574847194?tag=2.9.1-cuda12.8-cudnn9-runtime |
There was a problem hiding this comment.
How we currently build arm image for torchtune: https://github.com/kubeflow/trainer/blob/master/.github/workflows/build-and-push-images.yaml#L46 ?
| x86_platforms: linux/amd64 | ||
| arm_platforms: linux/arm64 | ||
|
|
||
| # TODO (andreyvelich): mlx[cuda] doesn't support arm at the moment: https://github.com/ml-explore/mlx/issues/2469 |
| DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }} | ||
| DOCKERHUB_TOKEN: ${{ secrets.DOCKERHUB_TOKEN }} | ||
|
|
||
| merge-manifests: |
There was a problem hiding this comment.
We do we need to merge manifests?
| runs-on: oracle-vm-16cpu-64gb-x86-64 | ||
|
|
||
| env: | ||
| SHOULD_PUBLISH: ${{ github.event_name == 'push' }} |
There was a problem hiding this comment.
We still need to keep this env to ensure we don't push images on pull_requests.
|
All arm based CI are failing, i will review this asap |
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
… dirs Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
96f6e02 to
529dfb6
Compare
|
/retest |
|
@jaiakash Did you get a chance to address comments? |
What this PR does / why we need it:
This PR adds
armbased trainer images using arm runner provided by CNCF. (oracle-16cpu-64gb-arm64)linux/ppc64leplatform image fortrainer-controller-manager(tentaive)Which issue(s) this PR fixes
Fixes #2422
Checklist: