feat: support for arm based trainer image using arm runner by jaiakash · Pull Request #3046 · kubeflow/trainer

jaiakash · 2025-12-21T12:26:19Z

What this PR does / why we need it:
This PR adds

Support to build arm based trainer images using arm runner provided by CNCF. (oracle-16cpu-64gb-arm64)
This also removes linux/ppc64le platform image for trainer-controller-manager (tentaive)

Which issue(s) this PR fixes
Fixes #2422

Checklist:

Docs included if any changes are user facing

coveralls · 2025-12-21T12:31:01Z

Pull Request Test Coverage Report for Build 20409798999

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 51.435%

Totals
Change from base Build 20289967122:	0.0%
Covered Lines:	1237
Relevant Lines:	2405

💛 - Coveralls

jaiakash · 2026-01-15T20:53:19Z

Hi @tenzen-y @astefanutti Can you please take a look at this PR.

astefanutti · 2026-01-16T10:18:09Z

@jaiakash thanks!

/lgtm

/assign @kubeflow/kubeflow-trainer-team

jaiakash · 2026-01-28T13:57:30Z

Hi @tenzen-y @andreyvelich PTAL.

github-actions · 2026-04-28T15:53:15Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

google-oss-prow · 2026-04-29T17:41:18Z

New changes are detected. LGTM label has been removed.

google-oss-prow · 2026-04-29T17:41:24Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from astefanutti. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

This PR restructures the container image build pipeline to build linux/arm64 images on an ARM runner (instead of QEMU emulation), and updates the publishing flow to merge per-arch outputs into multi-arch manifests.

Changes:

Replaced the composite action with a reusable workflow to build/push per-architecture images and upload digests as artifacts.
Updated the main image workflow to run a component×arch matrix (x86 + arm64) and added a follow-up job to merge multi-arch manifests from the uploaded digests.
Dropped linux/ppc64le from trainer-controller-manager builds.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
`.github/workflows/template-publish-image/publish-images.yaml`	New reusable workflow that builds per-arch images, optionally pushes, and uploads digests for manifest merging.
`.github/workflows/template-publish-image/action.yaml`	Removed the old composite action implementation.
`.github/workflows/build-and-push-images.yaml`	Switched to distributed builds across x86/arm runners and added a manifest-merge job.

jaiakash · 2026-04-29T20:23:06Z

Hi I have updated the PR with latest changes and here are few data points.

Here for trainer-controller-manager. we have setup for ppc64le1 arch as well, do we still need it?

trainer/.github/workflows/build-and-push-images.yaml

Line 27 in ffd22a0

platforms: linux/amd64,linux/arm64,linux/ppc64le
FYI, pytorch/pytorch CUDA base image does not support arm64. So i am building only for x86 arch type.
Check this https://github.com/orgs/pytorch/packages/container/pytorch/574847194?tag=2.9.1-cuda12.8-cudnn9-runtime

Sample run

on my fork can be seen here: https://github.com/jaiakash/trainer/actions/runs/25131769251/job/73659612821
I used GitHub's native ubuntu and ubuntu arm runners for testing

Example Images

dataset-initializer

https://hub.docker.com/r/akashjaiswal03/dataset-initializer/tags
https://github.com/jaiakash/trainer/pkgs/container/trainer%2Fdataset-initializer

model-initializer
https://hub.docker.com/r/akashjaiswal03/model-initializer/tags
https://github.com/jaiakash/trainer/pkgs/container/trainer%2Fmodel-initializer

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Copilot · 2026-04-29T20:28:44Z

+    secrets:
+      DOCKERHUB_USERNAME:
+        required: true
+      DOCKERHUB_TOKEN:
+        required: true
+


The reusable workflow marks Docker Hub secrets as required, which will cause pull_request runs (especially from forks) to fail even though the Docker Hub login steps are skipped; make these secrets optional and conditionally run Docker Hub login/push steps only when the secrets are present and publishing is intended.

Copilot · 2026-04-29T20:28:45Z

+  merge-manifests:
+    needs: build-and-publish
+    if: always()
+    runs-on: ubuntu-latest
+    strategy:
+      fail-fast: false
+      matrix:
+        component:
+          - name: trainer-controller-manager
+          - name: model-initializer
+          - name: dataset-initializer
+          - name: deepspeed-runtime
+          - name: xgboost-runtime
+          - name: mlx-runtime
+          - name: torchtune-trainer
+          - name: data-cache
+
    steps:
-      - name: Checkout
-        uses: actions/checkout@v6
+      - name: Download digests
+        uses: actions/download-artifact@v4
+        with:
+          path: /tmp/digests
+          pattern: digests-${{ matrix.component.name }}-*
+          merge-multiple: true



The merge-manifests job is configured with if: always(), but on pull_request runs no digest artifacts are uploaded, so download-artifact and the subsequent imagetools command will fail; gate this job (and the registry login steps) to only run when publishing (e.g., github.event_name == 'push') and ideally only when the build matrix succeeded.

Copilot · 2026-04-29T20:28:45Z

+        exclude:
+          - component:
+              name: mlx-runtime
+            arch:
+              name: arm64
+
+          - component:
+              name: torchtune-trainer
+            arch:
+              name: arm64
+
+    uses: ./.github/workflows/publish-images.yaml
+    with:
+      component-name: ${{ matrix.component.name }}
+      dockerfile: ${{ matrix.component.dockerfile }}
+      platforms: ${{ matrix.component[matrix.arch.platforms_key] }}


The matrix exclude entries likely won’t match because component/arch are object values and the exclude only specifies name (not the full object), which can leave arm64 jobs for components without arm_platforms and cause ${{ matrix.component[matrix.arch.platforms_key] }} to evaluate to null/throw; consider restructuring the matrix to enumerate valid (component, arch) pairs via include, or ensure all required object fields are present in the exclude match.

jaiakash · 2026-04-30T09:33:51Z

/retest

andreyvelich

@jaiakash Any thoughts why CI is failing?

andreyvelich · 2026-04-30T16:44:31Z

+# Each invocation builds a single platform natively on the specified runner,
+# pushes by digest (no tag), and uploads the digest as an artifact.
+# The caller workflow assembles the multi-arch manifest in a separate merge job.
+name: Build And Publish Container Images


Do we want to keep this in the template-publish-image/action.yaml?

andreyvelich · 2026-04-30T16:47:00Z

+          # TODO: (jaiakash) pytorch/pytorch CUDA base image does not support arm64
+          # Check this https://github.com/orgs/pytorch/packages/container/pytorch/574847194?tag=2.9.1-cuda12.8-cudnn9-runtime


How we currently build arm image for torchtune: https://github.com/kubeflow/trainer/blob/master/.github/workflows/build-and-push-images.yaml#L46 ?

andreyvelich · 2026-04-30T16:49:04Z

+            x86_platforms: linux/amd64
+            arm_platforms: linux/arm64
+
          # TODO (andreyvelich): mlx[cuda] doesn't support arm at the moment: https://github.com/ml-explore/mlx/issues/2469


@jaiakash As I can see mlx[cuda13] supports arm, so we can add it here.

cc @zcbenz

andreyvelich · 2026-04-30T16:52:42Z

+      DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }}
+      DOCKERHUB_TOKEN: ${{ secrets.DOCKERHUB_TOKEN }}
+
+  merge-manifests:


We do we need to merge manifests?

andreyvelich · 2026-04-30T16:54:13Z

-    runs-on: oracle-vm-16cpu-64gb-x86-64
-
-    env:
-      SHOULD_PUBLISH: ${{ github.event_name == 'push' }}


We still need to keep this env to ensure we don't push images on pull_requests.

jaiakash · 2026-04-30T19:04:01Z

All arm based CI are failing, i will review this asap

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

… dirs Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

jaiakash · 2026-05-05T17:59:49Z

/retest

andreyvelich · 2026-05-07T19:03:50Z

@jaiakash Did you get a chance to address comments?

google-oss-prow Bot added the size/L label Dec 21, 2025

google-oss-prow Bot requested review from jinchihe and kuizhiqing December 21, 2025 12:26

jaiakash changed the title ~~add: support for arm based trainer image using arm runner~~ feat: support for arm based trainer image using arm runner Dec 21, 2025

google-oss-prow Bot assigned astefanutti Jan 16, 2026

google-oss-prow Bot added the lgtm label Jan 16, 2026

jaiakash mentioned this pull request Feb 18, 2026

Feat(workflow) Add multi-arch docker buildx support to the image building workflows for building arm64 and amd64 container images kubeflow/pipelines#12804

Merged

2 tasks

github-actions Bot added the lifecycle/stale label Apr 28, 2026

Copilot AI review requested due to automatic review settings April 29, 2026 17:41

jaiakash force-pushed the support-arm-container branch from c9b59d6 to 1e36d16 Compare April 29, 2026 17:41

google-oss-prow Bot removed the lgtm label Apr 29, 2026

Copilot started reviewing on behalf of jaiakash April 29, 2026 17:41 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

Comment thread .github/workflows/publish-images.yaml

Comment thread .github/workflows/publish-images.yaml

Comment thread .github/workflows/publish-images.yaml Outdated

Comment thread .github/workflows/build-and-push-images.yaml

jaiakash force-pushed the support-arm-container branch from 1e36d16 to 96f6e02 Compare April 29, 2026 20:19

jaiakash requested a review from Copilot April 29, 2026 20:24

Copilot started reviewing on behalf of jaiakash April 29, 2026 20:24 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

github-actions Bot removed the lifecycle/stale label Apr 29, 2026

andreyvelich reviewed Apr 30, 2026

View reviewed changes

feat: support for arm runner for build and publish

43a37b8

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

jaiakash added 3 commits May 5, 2026 22:21

fix: resuable workflow and rm action.yaml

d808893

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

fix: merge conflicts and rerun

3d7668e

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

fix: arm runners name and removed unneesaary rm of docker images/data…

529dfb6

… dirs Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>

jaiakash force-pushed the support-arm-container branch from 96f6e02 to 529dfb6 Compare May 5, 2026 17:23

		# TODO: (jaiakash) pytorch/pytorch CUDA base image does not support arm64
		# Check this https://github.com/orgs/pytorch/packages/container/pytorch/574847194?tag=2.9.1-cuda12.8-cudnn9-runtime

Conversation

jaiakash commented Dec 21, 2025

Uh oh!

coveralls commented Dec 21, 2025

Pull Request Test Coverage Report for Build 20409798999

Details

💛 - Coveralls

Uh oh!

jaiakash commented Jan 15, 2026

Uh oh!

astefanutti commented Jan 16, 2026

Uh oh!

jaiakash commented Jan 28, 2026

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

google-oss-prow Bot commented Apr 29, 2026

Uh oh!

google-oss-prow Bot commented Apr 29, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jaiakash commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sample run

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

jaiakash commented Apr 30, 2026

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

andreyvelich Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

andreyvelich Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

andreyvelich Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

andreyvelich Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

andreyvelich Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

jaiakash commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jaiakash commented May 5, 2026

Uh oh!

andreyvelich commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

jaiakash commented Apr 29, 2026 •

edited

Loading

jaiakash commented Apr 30, 2026 •

edited

Loading