Fix #2407: Fix get_wikitext2 tokenization bug causing sequence length warning by Mr-Neutr0n · Pull Request #2449 · huggingface/optimum

Mr-Neutr0n · 2026-06-12T09:02:22Z

Changed get_wikitext2 in optimum/gptq/data.py:120-141 to tokenize individual samples with a retry loop (matching get_c4/get_c4_new) instead of concatenating 1000 entries and bulk-tokenizing, which produced sequences exceeding the model's max length.

Local test infra unavailable in CI sandbox.

This change was prepared with AI assistance under human direction and review.

…ence length w Signed-off-by: Mr-Neutr0n <64578610+Mr-Neutr0n@users.noreply.github.com>

Fix huggingface#2407: Fix get_wikitext2 tokenization bug causing sequ…

a444593

…ence length w Signed-off-by: Mr-Neutr0n <64578610+Mr-Neutr0n@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #2407: Fix get_wikitext2 tokenization bug causing sequence length warning#2449

Fix #2407: Fix get_wikitext2 tokenization bug causing sequence length warning#2449
Mr-Neutr0n wants to merge 1 commit into
huggingface:mainfrom
Mr-Neutr0n:agent/issue-2407-fix-getwikitext2-tokeni

Mr-Neutr0n commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Mr-Neutr0n commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant