Skip to content

Fix #2407: Fix get_wikitext2 tokenization bug causing sequence length warning#2449

Draft
Mr-Neutr0n wants to merge 1 commit into
huggingface:mainfrom
Mr-Neutr0n:agent/issue-2407-fix-getwikitext2-tokeni
Draft

Fix #2407: Fix get_wikitext2 tokenization bug causing sequence length warning#2449
Mr-Neutr0n wants to merge 1 commit into
huggingface:mainfrom
Mr-Neutr0n:agent/issue-2407-fix-getwikitext2-tokeni

Conversation

@Mr-Neutr0n

Copy link
Copy Markdown

Fixes #2407

Changed get_wikitext2 in optimum/gptq/data.py:120-141 to tokenize individual samples with a retry loop (matching get_c4/get_c4_new) instead of concatenating 1000 entries and bulk-tokenizing, which produced sequences exceeding the model's max length.

Local test infra unavailable in CI sandbox.


This change was prepared with AI assistance under human direction and review.

…ence length w

Signed-off-by: Mr-Neutr0n <64578610+Mr-Neutr0n@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant