[voxtral] language detection + skipping lang:xx #41225

eustlb · 2025-09-30T13:19:16Z

What does this PR do?

Adds the possibility of setting language=None in apply_transcription_request for automatic language detection.

Note

This is not breaking. Instead of being required, language is now optional.

It was a bit hidden, but Voxtral supports language detection by omitting the language token, e.g.

- <s>[INST][BEGIN_AUDIO][AUDIO]...[AUDIO][/INST]lang:en[TRANSCRIBE]
+ <s>[INST][BEGIN_AUDIO][AUDIO]...[AUDIO][/INST][TRANSCRIBE]

see here:

tokens = [*prefix, *tokens]
if request.language is not None:
    language_string = f"lang:{request.language}"  # no space.
    tokens += self.tokenizer.encode(language_string, bos=False, eos=False)

Other update

Important

🚨 In the specific case of Voxtral, the added f"lang:xx" (always a two char language code since it follows ISO 639-1 alpha-2 format) is not considered as a special token by mistral-common and is encoded/ decoded as normal text.
Nevertheless we should remove it to ease users life.

Added:

skipping logic in MistralCommonTokenizer's decode
associated test_decode_transcription_mode

HuggingFaceDocBuilderDev · 2025-09-30T13:36:04Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

eustlb · 2025-10-09T15:42:47Z

run-slow: voxtral

ArthurZucker

Nice catch 🤗

ArthurZucker · 2025-10-10T08:27:08Z

src/transformers/tokenization_mistral_common.py

+        lang_prefix = "lang:xx"
+        if skip_special_tokens and decoded_string.startswith("lang:"):
+            decoded_string = decoded_string[len(lang_prefix) :]
+


if it always starts with it (no other token in between) fine, otherwise would use regex

it should! but just in case I switched to using regex

github-actions · 2025-10-10T09:11:23Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: voxtral

* proc + doc update * improve doc * add lang:xx in decode * update voxtral test * nit * nit * update test value * use regex

eustlb and others added 3 commits September 30, 2025 15:11

proc + doc update

2dfbc61

improve doc

1c03d76

Merge branch 'main' into voxtral-auto-language

9845f30

eustlb marked this pull request as ready for review September 30, 2025 13:27

eustlb requested a review from ArthurZucker September 30, 2025 13:28

eustlb and others added 6 commits October 9, 2025 16:55

add lang:xx in decode

b8518ae

update voxtral test

592d550

nit

419cc33

nit

ae2730a

update test value

e42843a

Merge branch 'main' into voxtral-auto-language

46ce6e8

eustlb changed the title ~~[voxtral] enable language=None for language detection~~ [voxtral] language detection + skipping lang:xx Oct 9, 2025

ArthurZucker approved these changes Oct 10, 2025

View reviewed changes

use regex

ff5a922

eustlb enabled auto-merge (squash) October 10, 2025 09:11

eustlb merged commit c5094a4 into huggingface:main Oct 10, 2025
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[voxtral] language detection + skipping lang:xx #41225

[voxtral] language detection + skipping lang:xx #41225

Uh oh!

eustlb commented Sep 30, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Sep 30, 2025

Uh oh!

eustlb commented Oct 9, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker Oct 10, 2025

Uh oh!

eustlb Oct 10, 2025

Uh oh!

github-actions bot commented Oct 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[voxtral] language detection + skipping lang:xx #41225

[voxtral] language detection + skipping lang:xx #41225

Uh oh!

Conversation

eustlb commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Other update

Uh oh!

HuggingFaceDocBuilderDev commented Sep 30, 2025

Uh oh!

eustlb commented Oct 9, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

eustlb Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eustlb commented Sep 30, 2025 •

edited

Loading