Skip to content

feat: support for arm based trainer image using arm runner #3046

Open
jaiakash wants to merge 4 commits into
kubeflow:masterfrom
jaiakash:support-arm-container
Open

feat: support for arm based trainer image using arm runner #3046
jaiakash wants to merge 4 commits into
kubeflow:masterfrom
jaiakash:support-arm-container

Conversation

@jaiakash
Copy link
Copy Markdown
Member

What this PR does / why we need it:
This PR adds

  • Support to build arm based trainer images using arm runner provided by CNCF. (oracle-16cpu-64gb-arm64)
  • This also removes linux/ppc64le platform image for trainer-controller-manager (tentaive)

Which issue(s) this PR fixes
Fixes #2422

Checklist:

  • Docs included if any changes are user facing

@jaiakash jaiakash changed the title add: support for arm based trainer image using arm runner feat: support for arm based trainer image using arm runner Dec 21, 2025
@coveralls
Copy link
Copy Markdown

Pull Request Test Coverage Report for Build 20409798999

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 51.435%

Totals Coverage Status
Change from base Build 20289967122: 0.0%
Covered Lines: 1237
Relevant Lines: 2405

💛 - Coveralls

@jaiakash
Copy link
Copy Markdown
Member Author

Hi @tenzen-y @astefanutti Can you please take a look at this PR.

@astefanutti
Copy link
Copy Markdown
Contributor

@jaiakash thanks!

/lgtm

/assign @kubeflow/kubeflow-trainer-team

@jaiakash
Copy link
Copy Markdown
Member Author

Hi @tenzen-y @andreyvelich PTAL.

@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Copilot AI review requested due to automatic review settings April 29, 2026 17:41
@jaiakash jaiakash force-pushed the support-arm-container branch from c9b59d6 to 1e36d16 Compare April 29, 2026 17:41
@google-oss-prow google-oss-prow Bot removed the lgtm label Apr 29, 2026
@google-oss-prow
Copy link
Copy Markdown

New changes are detected. LGTM label has been removed.

@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from astefanutti. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR restructures the container image build pipeline to build linux/arm64 images on an ARM runner (instead of QEMU emulation), and updates the publishing flow to merge per-arch outputs into multi-arch manifests.

Changes:

  • Replaced the composite action with a reusable workflow to build/push per-architecture images and upload digests as artifacts.
  • Updated the main image workflow to run a component×arch matrix (x86 + arm64) and added a follow-up job to merge multi-arch manifests from the uploaded digests.
  • Dropped linux/ppc64le from trainer-controller-manager builds.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
.github/workflows/template-publish-image/publish-images.yaml New reusable workflow that builds per-arch images, optionally pushes, and uploads digests for manifest merging.
.github/workflows/template-publish-image/action.yaml Removed the old composite action implementation.
.github/workflows/build-and-push-images.yaml Switched to distributed builds across x86/arm runners and added a manifest-merge job.

Comment thread .github/workflows/publish-images.yaml
Comment thread .github/workflows/publish-images.yaml
Comment thread .github/workflows/publish-images.yaml Outdated
Comment thread .github/workflows/build-and-push-images.yaml
@jaiakash jaiakash force-pushed the support-arm-container branch from 1e36d16 to 96f6e02 Compare April 29, 2026 20:19
@jaiakash
Copy link
Copy Markdown
Member Author

jaiakash commented Apr 29, 2026

Hi I have updated the PR with latest changes and here are few data points.

Sample run

on my fork can be seen here: https://github.com/jaiakash/trainer/actions/runs/25131769251/job/73659612821
I used GitHub's native ubuntu and ubuntu arm runners for testing

Example Images

dataset-initializer

https://hub.docker.com/r/akashjaiswal03/dataset-initializer/tags
https://github.com/jaiakash/trainer/pkgs/container/trainer%2Fdataset-initializer

model-initializer
https://hub.docker.com/r/akashjaiswal03/model-initializer/tags
https://github.com/jaiakash/trainer/pkgs/container/trainer%2Fmodel-initializer

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment on lines +31 to +36
secrets:
DOCKERHUB_USERNAME:
required: true
DOCKERHUB_TOKEN:
required: true

Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reusable workflow marks Docker Hub secrets as required, which will cause pull_request runs (especially from forks) to fail even though the Docker Hub login steps are skipped; make these secrets optional and conditionally run Docker Hub login/push steps only when the secrets are present and publishing is intended.

Copilot uses AI. Check for mistakes.
Comment on lines +101 to 125
merge-manifests:
needs: build-and-publish
if: always()
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
component:
- name: trainer-controller-manager
- name: model-initializer
- name: dataset-initializer
- name: deepspeed-runtime
- name: xgboost-runtime
- name: mlx-runtime
- name: torchtune-trainer
- name: data-cache

steps:
- name: Checkout
uses: actions/checkout@v6
- name: Download digests
uses: actions/download-artifact@v4
with:
path: /tmp/digests
pattern: digests-${{ matrix.component.name }}-*
merge-multiple: true

Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The merge-manifests job is configured with if: always(), but on pull_request runs no digest artifacts are uploaded, so download-artifact and the subsequent imagetools command will fail; gate this job (and the registry login steps) to only run when publishing (e.g., github.event_name == 'push') and ideally only when the build matrix succeeded.

Copilot uses AI. Check for mistakes.
Comment on lines +78 to +93
exclude:
- component:
name: mlx-runtime
arch:
name: arm64

- component:
name: torchtune-trainer
arch:
name: arm64

uses: ./.github/workflows/publish-images.yaml
with:
component-name: ${{ matrix.component.name }}
dockerfile: ${{ matrix.component.dockerfile }}
platforms: ${{ matrix.component[matrix.arch.platforms_key] }}
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The matrix exclude entries likely won’t match because component/arch are object values and the exclude only specifies name (not the full object), which can leave arm64 jobs for components without arm_platforms and cause ${{ matrix.component[matrix.arch.platforms_key] }} to evaluate to null/throw; consider restructuring the matrix to enumerate valid (component, arch) pairs via include, or ensure all required object fields are present in the exclude match.

Copilot uses AI. Check for mistakes.
@jaiakash
Copy link
Copy Markdown
Member Author

/retest

Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaiakash Any thoughts why CI is failing?

# Each invocation builds a single platform natively on the specified runner,
# pushes by digest (no tag), and uploads the digest as an artifact.
# The caller workflow assembles the multi-arch manifest in a separate merge job.
name: Build And Publish Container Images
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep this in the template-publish-image/action.yaml?

Comment on lines +58 to +59
# TODO: (jaiakash) pytorch/pytorch CUDA base image does not support arm64
# Check this https://github.com/orgs/pytorch/packages/container/pytorch/574847194?tag=2.9.1-cuda12.8-cudnn9-runtime
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x86_platforms: linux/amd64
arm_platforms: linux/arm64

# TODO (andreyvelich): mlx[cuda] doesn't support arm at the moment: https://github.com/ml-explore/mlx/issues/2469
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaiakash As I can see mlx[cuda13] supports arm, so we can add it here.

cc @zcbenz

DOCKERHUB_USERNAME: ${{ secrets.DOCKERHUB_USERNAME }}
DOCKERHUB_TOKEN: ${{ secrets.DOCKERHUB_TOKEN }}

merge-manifests:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do we need to merge manifests?

runs-on: oracle-vm-16cpu-64gb-x86-64

env:
SHOULD_PUBLISH: ${{ github.event_name == 'push' }}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to keep this env to ensure we don't push images on pull_requests.

@jaiakash
Copy link
Copy Markdown
Member Author

jaiakash commented Apr 30, 2026

All arm based CI are failing, i will review this asap

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
jaiakash added 3 commits May 5, 2026 22:21
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
… dirs

Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
@jaiakash jaiakash force-pushed the support-arm-container branch from 96f6e02 to 529dfb6 Compare May 5, 2026 17:23
@jaiakash
Copy link
Copy Markdown
Member Author

jaiakash commented May 5, 2026

/retest

@andreyvelich
Copy link
Copy Markdown
Member

@jaiakash Did you get a chance to address comments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Leverage GitHub action arm64 runner

5 participants