You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the past, issues like #119 and #127 were silently introduced into PyTorch/XLA and ended up affecting model performance in torchprime. That's because the existing PyTorch/XLA Airflow CI runs in a post-submit fashion, and cannot be used to gate the landing of PRs that introduce regressions.
torchprime has E2E tests that train Llama and Mixtral on v6e-4, and already run these tests on every change. We propose to run these E2E tests on every PyTorch/XLA PR as well, and check for the following:
Loss is not NaN: ideally, we should also test that the loss roughly follows a specific curve, to catch surprising numerics bugs.
TPU HBM usage does not regress: this value can be obtained from the profile and we need to automate this procedure.
Step duration does not regress: we should measure a baseline performance for each model on v6e-4 and test that those don't regress.
The E2E tests currently take ~30 minutes to complete, while PyTorch/XLA TPU tests take >1hr, which means starting it in parallel does not add to the tail CI latency.
The text was updated successfully, but these errors were encountered:
In the past, issues like #119 and #127 were silently introduced into PyTorch/XLA and ended up affecting model performance in torchprime. That's because the existing PyTorch/XLA Airflow CI runs in a post-submit fashion, and cannot be used to gate the landing of PRs that introduce regressions.
torchprime has E2E tests that train Llama and Mixtral on v6e-4, and already run these tests on every change. We propose to run these E2E tests on every PyTorch/XLA PR as well, and check for the following:
The E2E tests currently take ~30 minutes to complete, while PyTorch/XLA TPU tests take >1hr, which means starting it in parallel does not add to the tail CI latency.
The text was updated successfully, but these errors were encountered: