Skip to content

Test torchprime from PyTorch/XLA #9152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Conversation

tengyifei
Copy link
Collaborator

@tengyifei tengyifei commented May 13, 2025

This PR fixes AI-Hypercomputer/torchprime#161. See issue description for motivation and details. While don't quite have HBM usage check and loss curve checks yet, those could be added later.

This PR adds a new workflow that will be run during pre- and post-submits:

  • Builds a docker image with the wheels from the PR/commit
  • Pushes the docker image to a temporary registry and URL
  • Triggers torchprime E2E tests with this docker image
  • Waits for the result of the E2E test

This will let us see if PyTorch/XLA PRs break model training in torchprime. I intend to first run this check as FYI-only for a while, and then later we could promote it to "required" to gate landing of PRs if it's very stable.

@tengyifei tengyifei changed the title Yifeit/torchprime ci 3 wip: test May 13, 2025
@tengyifei tengyifei force-pushed the yifeit/torchprime-ci-3 branch 6 times, most recently from 3dad488 to 9be6f35 Compare May 13, 2025 17:49
@tengyifei tengyifei force-pushed the yifeit/torchprime-ci-3 branch from 8c05962 to 96c6541 Compare May 20, 2025 16:40
@tengyifei tengyifei changed the title wip: test Test torchprime from PyTorch/XLA May 22, 2025
@tengyifei tengyifei marked this pull request as ready for review May 22, 2025 00:19
Copy link
Collaborator

@zhanyong-wan zhanyong-wan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@@ -0,0 +1,23 @@
# syntax=docker/dockerfile:1.4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document the purpose of this dockerfile (e.g. make it clear this is just for torchprime testing, not used by torchprime itself in its normal usage)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider renaming this to torchprime_e2e_test.Dockerfile to make the purpose clear?

FROM python:${python_version}-${debian_version} AS release

WORKDIR /tmp/wheels
COPY ./*.whl ./
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document what these wheels are?

@@ -0,0 +1,63 @@
#!/bin/bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider renaming to publish_torchprime_e2e_test_docker.sh for clarity?

sudo usermod -aG docker $USER
newgrp docker
shell: bash
# Googlers: if this fails, follow http://shortn/_61iSj31q1b to debug.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: use a go/ link?

DOCKER_IMAGE_TAG: ${{ steps.random_tag.outputs.uuid }}
DOCKER_PROJECT: tpu-pytorch
# Trigger torchprime E2E test workflow
- uses: convictional/[email protected]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document how to debug this workflow and get help if needed?

DOCKER_IMAGE_TAG: ${{ steps.random_tag.outputs.uuid }}
DOCKER_PROJECT: tpu-pytorch
# Trigger torchprime E2E test workflow
- uses: convictional/[email protected]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who provisions for this workflow? Does it have enough capacity for our needs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test torchprime from PyTorch/XLA CI
2 participants