[add-new-model-like] Robust search & proper outer '),' in tokenizer mapping #38703

alexzms · 2025-06-09T14:50:58Z

What does this PR do?

This PR makes the transformers-cli add-new-model-like command usable again for any model whose tokenizer mapping is written on multiple lines (e.g. llama).

Current failure modes

IndexError while locating TOKENIZER_MAPPING_NAMES.
Inside function transformers/src/transformers/commands/add_new_model_like.py/insert_tokenizer_in_auto_module().
The helper scans for the literal
```
"    TOKENIZER_MAPPING_NAMES = OrderedDict("
```
(four leading spaces, no type annotation).
Since in current version that line is un-indented and type-annotated:
```
TOKENIZER_MAPPING_NAMES = OrderedDict[str, tuple[Optional[str], Optional[str]]](
```
The hard-coded startswith() never matches, the loop overruns lines, and the command aborts with
```
IndexError: list index out of range
```
Unbalanced parentheses once the above is patched.
When the mapping tokenizers in transformers/src/transformers/models/auto/tokenization_auto.py of which entry spans several lines, the script copies only the inner block ending in
```
            ),
```
but forgets the outer
```
        ),
```
line. Insertion borrows this outer ), from the previous entry, leaving that entry syntactically broken and rendering tokenization_auto.py unimportable.

Fix

Replace the fixed-width search with a regex tolerant of any indentation and optional type annotations:
```
pattern_tokenizer = re.compile(r"^\s*TOKENIZER_MAPPING_NAMES\s*=\s*OrderedDict\b")
```
While copying a multi-line mapping block, keep collecting until the outer
```
        ),
```
line is also captured, ensuring the new block is fully closed before insertion.

No external dependencies are introduced-only the standard-library re module is used.

After the patch, running

transformers-cli add-new-model-like

completes without errors, and

python -m py_compile src/transformers/models/auto/tokenization_auto.py

succeeds.

I have not added a dedicated unit test; the change touches a dev-only CLI and has been verified manually. If desired, I can add a small test in tests/commands/. Also, I believe there's no documentation need to change based on this modification.

No issue number exists; the bug is reproducible via the steps above.

Before submitting

This PR fixes a typo or improves the docs
Did you read the contributor guideline?
Was this discussed/approved via a GitHub issue or the forum?
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

No necessary documentation/testcases updates are required for this change.

Who can review?

@ArthurZucker @gante

Rocketknight1 · 2025-06-10T12:08:16Z

@bot /style

Rocketknight1 · 2025-06-10T12:08:49Z

Verified the issue and the fix looks good, so I'm happy to accept this! Thank you!

github-actions · 2025-06-10T12:09:46Z

Style fixes have been applied. View the workflow run here.

…apping

HuggingFaceDocBuilderDev · 2025-06-10T12:25:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

alexzms · 2025-06-10T13:13:16Z

Verified the issue and the fix looks good, so I'm happy to accept this! Thank you!

I'm glad you like it, thank you so much for helping me solve the code-style problem~

…apping (huggingface#38703) * [add-new-model-like] Robust search & proper outer '),' in tokenizer mapping * code-style: arrange the importation in add_new_model_like.py * Apply style fixes --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

alexzms force-pushed the fix-add-new-model-like-tokenizer branch 2 times, most recently from 793f90b to 74d3e11 Compare June 10, 2025 03:15

Rocketknight1 approved these changes Jun 10, 2025

View reviewed changes

alexzms and others added 3 commits June 10, 2025 13:12

[add-new-model-like] Robust search & proper outer '),' in tokenizer m…

7b3cf1e

…apping

code-style: arrange the importation in add_new_model_like.py

68aca88

Apply style fixes

fa9dd37

Rocketknight1 force-pushed the fix-add-new-model-like-tokenizer branch from 6b0744e to fa9dd37 Compare June 10, 2025 12:12

Rocketknight1 enabled auto-merge (squash) June 10, 2025 12:12

Rocketknight1 merged commit 8ff22e9 into huggingface:main Jun 10, 2025
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[add-new-model-like] Robust search & proper outer '),' in tokenizer mapping #38703

[add-new-model-like] Robust search & proper outer '),' in tokenizer mapping #38703

Uh oh!

alexzms commented Jun 9, 2025

Uh oh!

Rocketknight1 commented Jun 10, 2025

Uh oh!

Rocketknight1 commented Jun 10, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 10, 2025

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jun 10, 2025

Uh oh!

alexzms commented Jun 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

[add-new-model-like] Robust search & proper outer '),' in tokenizer mapping #38703

[add-new-model-like] Robust search & proper outer '),' in tokenizer mapping #38703

Uh oh!

Conversation

alexzms commented Jun 9, 2025

What does this PR do?

Current failure modes

Before submitting

Who can review?

Uh oh!

Rocketknight1 commented Jun 10, 2025

Uh oh!

Rocketknight1 commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 10, 2025

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jun 10, 2025

Uh oh!

alexzms commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Rocketknight1 commented Jun 10, 2025 •

edited

Loading

alexzms commented Jun 10, 2025 •

edited

Loading