[add-new-model-like] Robust search & proper outer '),' in tokenizer mapping #38703
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR makes the
transformers-cli add-new-model-like
command usable again for any model whose tokenizer mapping is written on multiple lines (e.g. llama).Current failure modes
IndexError
while locatingTOKENIZER_MAPPING_NAMES
.Inside function
transformers/src/transformers/commands/add_new_model_like.py/insert_tokenizer_in_auto_module()
.The helper scans for the literal
" TOKENIZER_MAPPING_NAMES = OrderedDict("
(four leading spaces, no type annotation).
Since in current version that line is un-indented and type-annotated:
The hard-coded
startswith()
never matches, the loop overrunslines
, and the command aborts withUnbalanced parentheses once the above is patched.
When the mapping tokenizers in
transformers/src/transformers/models/auto/tokenization_auto.py
of which entry spans several lines, the script copies only the inner block ending inbut forgets the outer
line. Insertion borrows this outer
),
from the previous entry, leaving that entry syntactically broken and renderingtokenization_auto.py
unimportable.Fix
Replace the fixed-width search with a regex tolerant of any indentation and optional type annotations:
While copying a multi-line mapping block, keep collecting until the outer
line is also captured, ensuring the new block is fully closed before insertion.
No external dependencies are introduced-only the standard-library
re
module is used.After the patch, running
completes without errors, and
succeeds.
I have not added a dedicated unit test; the change touches a dev-only CLI and has been verified manually. If desired, I can add a small test in
tests/commands/
. Also, I believe there's no documentation need to change based on this modification.No issue number exists; the bug is reproducible via the steps above.
Before submitting
No necessary documentation/testcases updates are required for this change.
Who can review?
@ArthurZucker @gante