Skip to content

Use cached .no_exist entries from the hub cache in cached_path (#7686)#8266

Open
discobot wants to merge 1 commit into
huggingface:mainfrom
discobot:fix/7686-no-exist-cache
Open

Use cached .no_exist entries from the hub cache in cached_path (#7686)#8266
discobot wants to merge 1 commit into
huggingface:mainfrom
discobot:fix/7686-no-exist-cache

Conversation

@discobot

Copy link
Copy Markdown

Fixes #7686.

cached_path requested the Hub on every call for files that usually don't exist in dataset repos (.huggingface.yaml, dataset_infos.json) because its hf:// branch goes straight to HfApi.hf_hub_download. It now consults huggingface_hub.try_to_load_from_cache first and raises FileNotFoundError right away on a cached non-existence hit — the same error the EntryNotFoundError path raises, so callers are unaffected.

One correction to the issue: the write half already works. hf_hub_download has written the .no_exist/<sha>/<file> entries on EntryNotFoundError since before our minimum pin (checked 0.25.0, 0.33.2 and 1.19.0) — they just don't show up in plain ls because .no_exist is a dot-directory. So only the read side was missing, and the fix mirrors transformers.utils.hub.cached_files: the negative cache is only trusted when the resolved revision is a commit hash (immutable), and it is skipped with force_download=True.

Running the issue's metadata-counting repro against a small dataset with a fresh HF_HOME: 2 missing-file metadata requests on the first load_dataset call and 0 on every subsequent one (previously 2 on every call).

Added regression tests in tests/test_file_utils.py covering the negative-cache hit (no hf_hub_download call), non-sha revisions (still request the Hub), and force_download (ignores .no_exist).

…ngface#7686)

load_dataset requests the Hub on every call for files that usually don't exist (.huggingface.yaml, dataset_infos.json), even when huggingface_hub has already cached their non-existence as .no_exist entries. cached_path now consults try_to_load_from_cache before calling hf_hub_download and raises FileNotFoundError right away on a cached non-existence hit. The check is only done for revisions that are commit hashes, since those are immutable, and is skipped with force_download. Fixes huggingface#7686.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

load_dataset does not check .no_exist files in the hub cache

1 participant