Test torchprime from PyTorch/XLA CI #161

tengyifei · 2025-03-18T22:03:21Z

In the past, issues like #119 and #127 were silently introduced into PyTorch/XLA and ended up affecting model performance in torchprime. That's because the existing PyTorch/XLA Airflow CI runs in a post-submit fashion, and cannot be used to gate the landing of PRs that introduce regressions.

torchprime has E2E tests that train Llama and Mixtral on v6e-4, and already run these tests on every change. We propose to run these E2E tests on every PyTorch/XLA PR as well, and check for the following:

Loss is not NaN: ideally, we should also test that the loss roughly follows a specific curve, to catch surprising numerics bugs.
TPU HBM usage does not regress: this value can be obtained from the profile and we need to automate this procedure.
Step duration does not regress: we should measure a baseline performance for each model on v6e-4 and test that those don't regress.

The E2E tests currently take ~30 minutes to complete, while PyTorch/XLA TPU tests take >1hr, which means starting it in parallel does not add to the tail CI latency.

tengyifei self-assigned this Mar 18, 2025

This was referenced May 22, 2025

Run E2E tests with any base PyTorch/XLA docker image #247

Merged

Test torchprime from PyTorch/XLA pytorch/xla#9152

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test torchprime from PyTorch/XLA CI #161

Test torchprime from PyTorch/XLA CI #161

tengyifei commented Mar 18, 2025

Test torchprime from PyTorch/XLA CI #161

Test torchprime from PyTorch/XLA CI #161

Comments

tengyifei commented Mar 18, 2025