Fix unstable tokenizer fingerprinting (enables map cache reuse)#7982
Fix unstable tokenizer fingerprinting (enables map cache reuse)#7982lhoestq merged 5 commits intohuggingface:mainfrom
Conversation
8c1891b to
347e84a
Compare
Tokenizers backed by `tokenizers` can mutate truncation/padding state when called, which made dataset transform fingerprints unstable and prevented `.map(load_from_cache_file=True)` from reusing cached results. This change makes tokenizer hashing stable by temporarily clearing backend truncation/padding during serialization for fingerprinting, then restoring it. Add a regression test and a simple benchmark to demonstrate cache-hit speedups. Fixes huggingface#3847
347e84a to
1792715
Compare
|
Hi! It looks like the GitHub Actions check suites for this PR are in Could a maintainer please approve/run the workflows so CI can execute? Happy to address anything CI flags once it runs. |
|
Quick ping on this one. Happy to update anything if you want changes. |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
lhoestq
left a comment
There was a problem hiding this comment.
LGTM ! I took the liberty of improving it in case a tokenizer stores warnings messages, and removing the benchmark file which is maybe not necessary
|
cc @itazap and @ArthurZucker for viz: this is the trick we use to hash a tokenizer this way the hash doesn't change if the state of the tokenizer changed |
Fix unstable dataset fingerprinting when hashing
PreTrainedTokenizerFast.Some tokenizers backed by
tokenizers.Tokenizermutate runtime settings (padding/truncation) when called, which can change the serialized state and make dataset fingerprints unstable. That prevents.map(load_from_cache_file=True)from reusing cache files.Fix: when hashing, temporarily disable backend padding/truncation so runtime settings don’t affect the fingerprint, then restore the original settings.
Includes a regression test showing
Hasher.hash(tokenizer)stays stable after calling the tokenizer.