Skip to content

[add-new-model-like] Robust search & proper outer '),' in tokenizer mapping #38703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

alexzms
Copy link
Contributor

@alexzms alexzms commented Jun 9, 2025

What does this PR do?

This PR makes the transformers-cli add-new-model-like command usable again for any model whose tokenizer mapping is written on multiple lines (e.g. llama).

Current failure modes

  1. IndexError while locating TOKENIZER_MAPPING_NAMES.
    Inside function transformers/src/transformers/commands/add_new_model_like.py/insert_tokenizer_in_auto_module().
    The helper scans for the literal

    "    TOKENIZER_MAPPING_NAMES = OrderedDict("

    (four leading spaces, no type annotation).
    Since in current version that line is un-indented and type-annotated:

    TOKENIZER_MAPPING_NAMES = OrderedDict[str, tuple[Optional[str], Optional[str]]](

    The hard-coded startswith() never matches, the loop overruns lines, and the command aborts with

    IndexError: list index out of range
    
  2. Unbalanced parentheses once the above is patched.
    When the mapping tokenizers in transformers/src/transformers/models/auto/tokenization_auto.py of which entry spans several lines, the script copies only the inner block ending in

                ),
    

    but forgets the outer

            ),
    

    line. Insertion borrows this outer ), from the previous entry, leaving that entry syntactically broken and rendering tokenization_auto.py unimportable.

Fix

  • Replace the fixed-width search with a regex tolerant of any indentation and optional type annotations:

    pattern_tokenizer = re.compile(r"^\s*TOKENIZER_MAPPING_NAMES\s*=\s*OrderedDict\b")
  • While copying a multi-line mapping block, keep collecting until the outer

            ),

    line is also captured, ensuring the new block is fully closed before insertion.

No external dependencies are introduced-only the standard-library re module is used.

After the patch, running

transformers-cli add-new-model-like

completes without errors, and

python -m py_compile src/transformers/models/auto/tokenization_auto.py

succeeds.

I have not added a dedicated unit test; the change touches a dev-only CLI and has been verified manually. If desired, I can add a small test in tests/commands/. Also, I believe there's no documentation need to change based on this modification.

No issue number exists; the bug is reproducible via the steps above.

Before submitting

  • This PR fixes a typo or improves the docs
  • Did you read the contributor guideline?
  • Was this discussed/approved via a GitHub issue or the forum?
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

No necessary documentation/testcases updates are required for this change.

Who can review?

@ArthurZucker @gante

@alexzms alexzms force-pushed the fix-add-new-model-like-tokenizer branch 2 times, most recently from 793f90b to 74d3e11 Compare June 10, 2025 03:15
@Rocketknight1
Copy link
Member

@bot /style

@Rocketknight1
Copy link
Member

Rocketknight1 commented Jun 10, 2025

Verified the issue and the fix looks good, so I'm happy to accept this! Thank you!

Copy link
Contributor

Style fixes have been applied. View the workflow run here.

@Rocketknight1 Rocketknight1 force-pushed the fix-add-new-model-like-tokenizer branch from 6b0744e to fa9dd37 Compare June 10, 2025 12:12
@Rocketknight1 Rocketknight1 enabled auto-merge (squash) June 10, 2025 12:12
@Rocketknight1 Rocketknight1 merged commit 8ff22e9 into huggingface:main Jun 10, 2025
11 checks passed
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@alexzms
Copy link
Contributor Author

alexzms commented Jun 10, 2025

Verified the issue and the fix looks good, so I'm happy to accept this! Thank you!

I'm glad you like it, thank you so much for helping me solve the code-style problem~

lmarshall12 pushed a commit to lmarshall12/transformers that referenced this pull request Jun 11, 2025
…apping (huggingface#38703)

* [add-new-model-like] Robust search & proper outer '),' in tokenizer mapping

* code-style: arrange the importation in add_new_model_like.py

* Apply style fixes

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
bvantuan pushed a commit to bvantuan/transformers that referenced this pull request Jun 12, 2025
…apping (huggingface#38703)

* [add-new-model-like] Robust search & proper outer '),' in tokenizer mapping

* code-style: arrange the importation in add_new_model_like.py

* Apply style fixes

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants