Use cached .no_exist entries from the hub cache in cached_path (#7686)#8266
Open
discobot wants to merge 1 commit into
Open
Use cached .no_exist entries from the hub cache in cached_path (#7686)#8266discobot wants to merge 1 commit into
discobot wants to merge 1 commit into
Conversation
…ngface#7686) load_dataset requests the Hub on every call for files that usually don't exist (.huggingface.yaml, dataset_infos.json), even when huggingface_hub has already cached their non-existence as .no_exist entries. cached_path now consults try_to_load_from_cache before calling hf_hub_download and raises FileNotFoundError right away on a cached non-existence hit. The check is only done for revisions that are commit hashes, since those are immutable, and is skipped with force_download. Fixes huggingface#7686.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #7686.
cached_pathrequested the Hub on every call for files that usually don't exist in dataset repos (.huggingface.yaml,dataset_infos.json) because itshf://branch goes straight toHfApi.hf_hub_download. It now consultshuggingface_hub.try_to_load_from_cachefirst and raisesFileNotFoundErrorright away on a cached non-existence hit — the same error theEntryNotFoundErrorpath raises, so callers are unaffected.One correction to the issue: the write half already works.
hf_hub_downloadhas written the.no_exist/<sha>/<file>entries onEntryNotFoundErrorsince before our minimum pin (checked 0.25.0, 0.33.2 and 1.19.0) — they just don't show up in plainlsbecause.no_existis a dot-directory. So only the read side was missing, and the fix mirrorstransformers.utils.hub.cached_files: the negative cache is only trusted when the resolved revision is a commit hash (immutable), and it is skipped withforce_download=True.Running the issue's metadata-counting repro against a small dataset with a fresh
HF_HOME: 2 missing-file metadata requests on the firstload_datasetcall and 0 on every subsequent one (previously 2 on every call).Added regression tests in
tests/test_file_utils.pycovering the negative-cache hit (nohf_hub_downloadcall), non-sha revisions (still request the Hub), andforce_download(ignores.no_exist).