Skip to content

[RFC] Add HuggingFace tests with pinned dependencies to CI #8542

Closed
GoogleCloudPlatform/ml-auto-solutions
#548
@tengyifei

Description

@tengyifei

🚀 Feature

This RFC proposes to add a number of tests in PyTorch/XLA CI that exercises the
combination of torch_xla and Hugging Face libraries.

Motivation

Testing against our customer's code ensures that we do not break common user
workflows.

Pitch

Historically, PyTorch/XLA CI had some HuggingFace tests that install the latest
version of transformer, diffuser, and accelerate from the main branch of
the respective git repositories. That causes test breakage when HuggingFace
introduced backwards incompatible changes. To prevent those issues, we'll pin
HuggingFace libraries to a fixed version when running the tests.

In principle, we should pin other packages that may affect the training,
such as numpy. However, torch_xla and torch itself also depends on a
number of Python libraries, such as numpy and networkx. Therefore we'll keep
the list of pinned packages small to start with, and we can always grow later if
a particular package becomes problematic.

List of tests

We propose these tests, which is a slight variation of the existing tests removed
in 3.

Name Type Test in nightly? Test in RC? Notes
Llama 2 7B training Example Yes (already exists) Yes (already exists) Testing the llama2-google-next-training branch in pytorch-tpu fork of HF transformers
SD2 training Example New addition New addition Testing the main branch in pytorch-tpu fork of HF diffusers
accelerate test Smoke test Add back Add back See note #1.
bert Example Add back Add back This exercises our own test (pytorch/xla/test/pjrt/test_train_hf_transformer.py) so we should run it
diffusers Example Remove Remove This trains stable-diffusion-v1. Replaced by planned SD2 training test

The SD2 training test will be added referencing the recipe in tpu-recipes 2.

Note #1: the accelerate test broke for a few weeks and we suspected it was due to
upstream changes in Hugging Face. After I filed 4, it turns out that this was
really a case of PyTorch/XLA changes 5 breaking Hugging Face. When we add back
this test we should workaround the breakage.

Note #2: during local testing, the bert test has a race condition at the end
causing a OSError: handle is closed. That also looks like a legit error
stemming from incorrect multiprocessing usage.

Initial pinned versions

Based on local testing, I've narrowed to the following versions that works for
the above tests:

accelerate==1.2.1
datasets==3.2.0
evaluate==0.4.3
huggingface-hub==0.27.1
safetensors==0.5.0
tokenizers==0.19.1

We'll check in this file as a pip-constraints.txt (constraint file 1) in
https://github.com/GoogleCloudPlatform/ml-auto-solutions, so that whenever a
HuggingFace library is installed, it is constrained to be one of the tested
version. This file will be shared by all tests in the list above.

transformers will be installed from
https://github.com/pytorch-tpu/transformers/tree/llama2-google-next-training
and diffusers will be installed from
https://github.com/pytorch-tpu/diffusers/tree/main. If we don't touch these
branches, then they will also be effectively pinned.

What to do if a test fail?

We should prioritize on reverting an offending PR if a change in torch_xla
broke HuggingFace tests.

Alternatives

It's also worth testing tip-of-tree versions of HuggingFace libraries against
stable versions of torch_xla. This ensures that HuggingFace does not introduce
new breakages in their development cycle. We should work with the HuggingFace
team to help them setup the tests on their end. That can be done independently
from this proposal.

Additional context

We had some HuggingFace tests for a while but they frequently broke due to the
lack of version pinning, and they were removed in 3.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions