Skip to content

[Bugfix][CI] Retry cached HF tokenizer load after transport failures#44820

Open
AndreasKaratzas wants to merge 6 commits into
vllm-project:mainfrom
ROCm:akaratza_retry_hf
Open

[Bugfix][CI] Retry cached HF tokenizer load after transport failures#44820
AndreasKaratzas wants to merge 6 commits into
vllm-project:mainfrom
ROCm:akaratza_retry_hf

Conversation

@AndreasKaratzas

Copy link
Copy Markdown
Member

There are some failed CI builds (in ROCm at least) (example: quantization test during engine startup) where the failure is:

tests/kernels/quantization/test_triton_scaled_mm.py::test_rocm_compressed_tensors_w8a8[10-32-neuralmagic/Llama-3.2-1B-quantized.w8a8]

This is not a kernel issue. Tokenizer construction made a live Hugging Face Hub metadata request and hit:

httpx.RemoteProtocolError: Server disconnected without sending a response.

So this PR introduces a transient Hub transport failure should not fail startup if the tokenizer files are already complete in the local cache. If the cache is incomplete, startup still fails with a clear error.

cc @kenroche

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@AndreasKaratzas AndreasKaratzas marked this pull request as ready for review June 8, 2026 02:33

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added the bug Something isn't working label Jun 8, 2026
@DarkLight1337 DarkLight1337 requested a review from hmellor June 8, 2026 04:05

@hmellor hmellor left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea, but I'd like to suggest a different implementation.

# vllm/transformers_utils/repo_utils.py

@contextmanager
def retry_with_local_files_only_in_ci(
    func: Callable[..., _R],
) -> Iterator[Callable[..., _R]]:
    """
    Wrap a function to retry with `local_files_only=True` if it fails in CI environment.
    """

    def wrapper(*args, **kwargs) -> _R:
        try:
            return func(*args, **kwargs)
        except Exception as e:
            if not os.environ.get("CI"):
                raise
            logger.warning(
                "Call to %s failed in CI; retrying with local_files_only=True: %s",
                getattr(func, "__qualname__", func),
                e,
            )
            kwargs["local_files_only"] = True
            return func(*args, **kwargs)

    yield wrapper

which would be used as follows:

            with retry_with_local_files_only_in_ci(AutoTokenizer.from_pretrained) as from_pretrained
                tokenizer = from_pretrained(
                    path_or_repo_id,
                    *args,
                    trust_remote_code=trust_remote_code,
                    revision=revision,
                    cache_dir=download_dir,
                    **kwargs,
                )

this could then:

  • be reused in other places where these timeouts occur
  • only have an effect in CI

@hmellor

hmellor commented Jun 9, 2026

Copy link
Copy Markdown
Member

Or even more generally: retry_with_kwargs_in_ci so that we can use it for various interfaces which may use different kwargs for offline mode

@mergify mergify Bot added the performance Performance-related issues label Jun 11, 2026
@AndreasKaratzas

Copy link
Copy Markdown
Member Author
  • Added a reusable retry_with_kwargs_in_ci helper in repo_utils.py.
  • Use that in HF tokenizer loading so CI retries once with local_files_only=True after a failed Hub call.
  • Make MT-Bench opt into checking the latest HF dataset revision and force a redownload when that revision is not present in the local dataset cache. (adjacent problem in CI)
    cc @hmellor

Comment thread vllm/transformers_utils/repo_utils.py Outdated
def retry_with_kwargs_in_ci(
func: Callable[..., _R],
**retry_kwargs: Any,
) -> Iterator[Callable[..., _R]]:

@hmellor hmellor Jun 19, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why iterator? Does this not just return Callable[..., _R]?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, you're right. I removed the context manager shape and now return the wrapped callable directly.

Comment thread vllm/transformers_utils/repo_utils.py Outdated
except Exception as e:
if not os.environ.get("CI"):
raise
if all(kwargs.get(key) == value for key, value in retry_kwargs.items()):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail if retry_kwargs ever sets anything to None and it's not in kwargs

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the check to require the key to be present before comparing, so missing keys no longer match None

@hmellor hmellor left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datasets changes seem unrelated?

I hadn't read your comment before reviewing

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are going to specify the latest revision, there is no need to also pass FORCE_REDOWNLOAD, if you are forcing the latest revision and it's already there re-downloading just wastes resources/time

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the cache scan and FORCE_REDOWNLOAD, you are right.

monkeypatch.delenv("CI", raising=False)
calls = 0

def failing_call():

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test would fail because it doesn't accept kwargs instead of too many calls

Suggested change
def failing_call():
def failing_call(**kwargs):

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, simplified the test callback to accept **kwargs directly

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Comment thread vllm/transformers_utils/repo_utils.py
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@AndreasKaratzas

Copy link
Copy Markdown
Member Author

I also added a feature for ROCm CI because the original cache fix only covered the MT-Bench dataset path, but the same stale-cache issue can happen for any Hugging Face model or dataset loaded without an explicit revision. The new behavior is opt-in and scoped to AMD CI: run-amd-test.sh enables VLLM_CI_ENSURE_LATEST_HF_REVISION and passes it into the Docker container, while the default remains off in envs.py. The actual revision resolution is in maybe_resolve_latest_hf_revision() in repo_utils.py, and the model path uses it from ModelConfig; dataset loading reuses the same helper. Explicit revisions are preserved, local/offline paths are skipped, and access/not-found Hub errors are not treated as transient cache fallback cases.

@hmellor PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ci/build performance Performance-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants