Test torchprime from PyTorch/XLA #9152

tengyifei · 2025-05-13T00:48:28Z

This PR fixes AI-Hypercomputer/torchprime#161. See issue description for motivation and details. While don't quite have HBM usage check and loss curve checks yet, those could be added later.

This PR adds a new workflow that will be run during pre- and post-submits:

Builds a docker image with the wheels from the PR/commit
Pushes the docker image to a temporary registry and URL
Triggers torchprime E2E tests with this docker image
Waits for the result of the E2E test

This will let us see if PyTorch/XLA PRs break model training in torchprime. I intend to first run this check as FYI-only for a while, and then later we could promote it to "required" to gate landing of PRs if it's very stable.

zhanyong-wan

Awesome!

zhanyong-wan · 2025-05-22T00:51:25Z

infra/ansible/build_for_torchprime.Dockerfile

@@ -0,0 +1,23 @@
+# syntax=docker/dockerfile:1.4


Document the purpose of this dockerfile (e.g. make it clear this is just for torchprime testing, not used by torchprime itself in its normal usage)?

Consider renaming this to torchprime_e2e_test.Dockerfile to make the purpose clear?

zhanyong-wan · 2025-05-22T00:51:35Z

infra/ansible/build_for_torchprime.Dockerfile

+FROM python:${python_version}-${debian_version} AS release
+
+WORKDIR /tmp/wheels
+COPY ./*.whl ./


Document what these wheels are?

zhanyong-wan · 2025-05-22T00:55:48Z

infra/ansible/build_for_torchprime.sh

@@ -0,0 +1,63 @@
+#!/bin/bash


Consider renaming to publish_torchprime_e2e_test_docker.sh for clarity?

zhanyong-wan · 2025-05-22T00:57:31Z

.github/workflows/_torchprime_ci.yml

+          sudo usermod -aG docker $USER
+          newgrp docker
+        shell: bash
+      # Googlers: if this fails, follow http://shortn/_61iSj31q1b to debug.


Nit: use a go/ link?

zhanyong-wan · 2025-05-22T01:00:44Z

.github/workflows/_torchprime_ci.yml

+          DOCKER_IMAGE_TAG: ${{ steps.random_tag.outputs.uuid }}
+          DOCKER_PROJECT: tpu-pytorch
+      # Trigger torchprime E2E test workflow
+      - uses: convictional/[email protected]


Can we document how to debug this workflow and get help if needed?

zhanyong-wan · 2025-05-22T01:01:56Z

.github/workflows/_torchprime_ci.yml

+          DOCKER_IMAGE_TAG: ${{ steps.random_tag.outputs.uuid }}
+          DOCKER_PROJECT: tpu-pytorch
+      # Trigger torchprime E2E test workflow
+      - uses: convictional/[email protected]


Who provisions for this workflow? Does it have enough capacity for our needs?

tengyifei changed the title ~~Yifeit/torchprime ci 3~~ wip: test May 13, 2025

tengyifei force-pushed the yifeit/torchprime-ci-3 branch 6 times, most recently from 3dad488 to 9be6f35 Compare May 13, 2025 17:49

tengyifei added 6 commits May 20, 2025 09:40

wip: trigger torchprime

2a752e2

build docker image in test

3a8fdd1

fix

c4d7f56

test

44f8e43

add ansible context

ad34f5b

Test passing docker URL to torchprime e2e test

96c6541

tengyifei force-pushed the yifeit/torchprime-ci-3 branch from 8c05962 to 96c6541 Compare May 20, 2025 16:40

tengyifei added 2 commits May 20, 2025 10:20

make docker URL random to prevent collision

d607aac

Handle 1 exit code from read

8934753

tengyifei changed the title ~~wip: test~~ Test torchprime from PyTorch/XLA May 22, 2025

tengyifei marked this pull request as ready for review May 22, 2025 00:19

tengyifei requested review from lsy323, ManfeiBai, zpcore, bhavya01 and qihqi as code owners May 22, 2025 00:19

tengyifei added 2 commits May 21, 2025 17:19

Document

6c46342

Document more

6f84690

tengyifei requested review from yaoshiang and zhanyong-wan May 22, 2025 00:30

zhanyong-wan requested changes May 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test torchprime from PyTorch/XLA #9152

Test torchprime from PyTorch/XLA #9152

Uh oh!

tengyifei commented May 13, 2025 •

edited

Loading

Uh oh!

zhanyong-wan left a comment

Uh oh!

zhanyong-wan May 22, 2025

Uh oh!

zhanyong-wan May 22, 2025

Uh oh!

zhanyong-wan May 22, 2025

Uh oh!

zhanyong-wan May 22, 2025

Uh oh!

zhanyong-wan May 22, 2025

Uh oh!

zhanyong-wan May 22, 2025

Uh oh!

zhanyong-wan May 22, 2025

Uh oh!

Uh oh!

Test torchprime from PyTorch/XLA #9152

Are you sure you want to change the base?

Test torchprime from PyTorch/XLA #9152

Uh oh!

Conversation

tengyifei commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhanyong-wan left a comment

Choose a reason for hiding this comment

Uh oh!

zhanyong-wan May 22, 2025

Choose a reason for hiding this comment

Uh oh!

zhanyong-wan May 22, 2025

Choose a reason for hiding this comment

Uh oh!

zhanyong-wan May 22, 2025

Choose a reason for hiding this comment

Uh oh!

zhanyong-wan May 22, 2025

Choose a reason for hiding this comment

Uh oh!

zhanyong-wan May 22, 2025

Choose a reason for hiding this comment

Uh oh!

zhanyong-wan May 22, 2025

Choose a reason for hiding this comment

Uh oh!

zhanyong-wan May 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tengyifei commented May 13, 2025 •

edited

Loading