Skip to content

Conversation

@eustlb
Copy link
Contributor

@eustlb eustlb commented Sep 30, 2025

What does this PR do?

Adds the possibility of setting language=None in apply_transcription_request for automatic language detection.

Note

This is not breaking. Instead of being required, language is now optional.

It was a bit hidden, but Voxtral supports language detection by omitting the language token, e.g.

- <s>[INST][BEGIN_AUDIO][AUDIO]...[AUDIO][/INST]lang:en[TRANSCRIBE]
+ <s>[INST][BEGIN_AUDIO][AUDIO]...[AUDIO][/INST][TRANSCRIBE]

see here:

tokens = [*prefix, *tokens]
if request.language is not None:
    language_string = f"lang:{request.language}"  # no space.
    tokens += self.tokenizer.encode(language_string, bos=False, eos=False)

Other update

Important

🚨 In the specific case of Voxtral, the added f"lang:xx" (always a two char language code since it follows ISO 639-1 alpha-2 format) is not considered as a special token by mistral-common and is encoded/ decoded as normal text.
Nevertheless we should remove it to ease users life.

Added:

  • skipping logic in MistralCommonTokenizer's decode
  • associated test_decode_transcription_mode

@eustlb eustlb marked this pull request as ready for review September 30, 2025 13:27
@eustlb eustlb requested a review from ArthurZucker September 30, 2025 13:28
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@eustlb
Copy link
Contributor Author

eustlb commented Oct 9, 2025

run-slow: voxtral

@eustlb eustlb changed the title [voxtral] enable language=None for language detection [voxtral] language detection + skipping lang:xx Oct 9, 2025
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch 🤗

Comment on lines 477 to 480
lang_prefix = "lang:xx"
if skip_special_tokens and decoded_string.startswith("lang:"):
decoded_string = decoded_string[len(lang_prefix) :]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it always starts with it (no other token in between) fine, otherwise would use regex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should! but just in case I switched to using regex

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: voxtral

@eustlb eustlb enabled auto-merge (squash) October 10, 2025 09:11
@eustlb eustlb merged commit c5094a4 into huggingface:main Oct 10, 2025
20 checks passed
AhnJoonSung pushed a commit to AhnJoonSung/transformers that referenced this pull request Oct 12, 2025
* proc + doc update

* improve doc

* add lang:xx in decode

* update voxtral test

* nit

* nit

* update test value

* use regex
ngazagna-qc pushed a commit to ngazagna-qc/transformers that referenced this pull request Oct 23, 2025
* proc + doc update

* improve doc

* add lang:xx in decode

* update voxtral test

* nit

* nit

* update test value

* use regex
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants