Fix unstable tokenizer fingerprinting (enables map cache reuse) by KOKOSde · Pull Request #7982 · huggingface/datasets

KOKOSde · 2026-02-02T23:34:51Z

Fix unstable dataset fingerprinting when hashing PreTrainedTokenizerFast.

Some tokenizers backed by tokenizers.Tokenizer mutate runtime settings (padding/truncation) when called, which can change the serialized state and make dataset fingerprints unstable. That prevents .map(load_from_cache_file=True) from reusing cache files.

Fix: when hashing, temporarily disable backend padding/truncation so runtime settings don’t affect the fingerprint, then restore the original settings.

Includes a regression test showing Hasher.hash(tokenizer) stays stable after calling the tokenizer.

Tokenizers backed by `tokenizers` can mutate truncation/padding state when called, which made dataset transform fingerprints unstable and prevented `.map(load_from_cache_file=True)` from reusing cached results. This change makes tokenizer hashing stable by temporarily clearing backend truncation/padding during serialization for fingerprinting, then restoring it. Add a regression test and a simple benchmark to demonstrate cache-hit speedups. Fixes huggingface#3847

KOKOSde · 2026-02-09T22:14:43Z

Hi! It looks like the GitHub Actions check suites for this PR are in action_required (no workflows actually ran). This is usually due to fork workflow approval.

Could a maintainer please approve/run the workflows so CI can execute? Happy to address anything CI flags once it runs.

KOKOSde · 2026-03-03T09:39:40Z

Quick ping on this one. Happy to update anything if you want changes.

HuggingFaceDocBuilderDev · 2026-03-09T18:10:23Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lhoestq

LGTM ! I took the liberty of improving it in case a tokenizer stores warnings messages, and removing the benchmark file which is maybe not necessary

lhoestq · 2026-03-09T18:48:40Z

cc @itazap and @ArthurZucker for viz: this is the trick we use to hash a tokenizer

this way the hash doesn't change if the state of the tokenizer changed

KOKOSde force-pushed the perf/stable-tokenizer-fingerprint branch 2 times, most recently from 8c1891b to 347e84a Compare February 5, 2026 05:29

KOKOSde force-pushed the perf/stable-tokenizer-fingerprint branch from 347e84a to 1792715 Compare February 5, 2026 05:41

Merge branch 'main' into perf/stable-tokenizer-fingerprint

1056b1a

lhoestq added 3 commits March 9, 2026 19:10

style

ef58996

delete unnecessary bench

0e8469e

also account for deprecation_warnings

942415a

lhoestq approved these changes Mar 9, 2026

View reviewed changes

lhoestq merged commit 4de29bf into huggingface:main Mar 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix unstable tokenizer fingerprinting (enables map cache reuse)#7982

Fix unstable tokenizer fingerprinting (enables map cache reuse)#7982
lhoestq merged 5 commits intohuggingface:mainfrom
KOKOSde:perf/stable-tokenizer-fingerprint

KOKOSde commented Feb 2, 2026 •

edited

Loading

Uh oh!

KOKOSde commented Feb 9, 2026

Uh oh!

KOKOSde commented Mar 3, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Mar 9, 2026

Uh oh!

lhoestq left a comment

Uh oh!

lhoestq commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

KOKOSde commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KOKOSde commented Feb 9, 2026

Uh oh!

KOKOSde commented Mar 3, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Mar 9, 2026

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

lhoestq commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KOKOSde commented Feb 2, 2026 •

edited

Loading