Skip to content

Fix unstable tokenizer fingerprinting (enables map cache reuse)#7982

Merged
lhoestq merged 5 commits intohuggingface:mainfrom
KOKOSde:perf/stable-tokenizer-fingerprint
Mar 9, 2026
Merged

Fix unstable tokenizer fingerprinting (enables map cache reuse)#7982
lhoestq merged 5 commits intohuggingface:mainfrom
KOKOSde:perf/stable-tokenizer-fingerprint

Conversation

@KOKOSde
Copy link
Contributor

@KOKOSde KOKOSde commented Feb 2, 2026

Fix unstable dataset fingerprinting when hashing PreTrainedTokenizerFast.

Some tokenizers backed by tokenizers.Tokenizer mutate runtime settings (padding/truncation) when called, which can change the serialized state and make dataset fingerprints unstable. That prevents .map(load_from_cache_file=True) from reusing cache files.

Fix: when hashing, temporarily disable backend padding/truncation so runtime settings don’t affect the fingerprint, then restore the original settings.

Includes a regression test showing Hasher.hash(tokenizer) stays stable after calling the tokenizer.

@KOKOSde KOKOSde force-pushed the perf/stable-tokenizer-fingerprint branch 2 times, most recently from 8c1891b to 347e84a Compare February 5, 2026 05:29
Tokenizers backed by `tokenizers` can mutate truncation/padding state when called, which made dataset transform fingerprints unstable and prevented `.map(load_from_cache_file=True)` from reusing cached results.

This change makes tokenizer hashing stable by temporarily clearing backend truncation/padding during serialization for fingerprinting, then restoring it.

Add a regression test and a simple benchmark to demonstrate cache-hit speedups.

Fixes huggingface#3847
@KOKOSde KOKOSde force-pushed the perf/stable-tokenizer-fingerprint branch from 347e84a to 1792715 Compare February 5, 2026 05:41
@KOKOSde
Copy link
Contributor Author

KOKOSde commented Feb 9, 2026

Hi! It looks like the GitHub Actions check suites for this PR are in action_required (no workflows actually ran). This is usually due to fork workflow approval.

Could a maintainer please approve/run the workflows so CI can execute? Happy to address anything CI flags once it runs.

@KOKOSde
Copy link
Contributor Author

KOKOSde commented Mar 3, 2026

Quick ping on this one. Happy to update anything if you want changes.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! I took the liberty of improving it in case a tokenizer stores warnings messages, and removing the benchmark file which is maybe not necessary

@lhoestq lhoestq merged commit 4de29bf into huggingface:main Mar 9, 2026
@lhoestq
Copy link
Member

lhoestq commented Mar 9, 2026

cc @itazap and @ArthurZucker for viz: this is the trick we use to hash a tokenizer

this way the hash doesn't change if the state of the tokenizer changed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants