-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Add kyutai stt #38909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Add kyutai stt #38909
Changes from 91 commits
Commits
Show all changes
107 commits
Select commit
Hold shift + click to select a range
936792a
first draft
eustlb 111f3ea
Merge branch 'huggingface:main' into moshi-asr
eustlb 536e55d
cleaner version
eustlb 53f9743
Merge branch 'huggingface:main' into moshi-asr
eustlb 22be12c
udpate tests + modeling
eustlb ca508a0
add tests
eustlb f83bbb0
init
eustlb ff510f7
udpate test_modeling_common
eustlb af3ff35
fix tests
eustlb efd6736
csm Processor draft
eustlb c132465
convertion update
eustlb 9966e56
mimi cache padding convolutions draft
eustlb 4c1feb3
mimi streaming udpates
eustlb 72289ef
update mimi padding cache test
eustlb 8023ebd
udpate cache padding mimi test
eustlb b7dccd3
make style mimi
eustlb 7242952
updates generate moshi asr
eustlb 135730e
moshi asr integration tests (single + batched)
eustlb 0679d29
update tests
eustlb 99322b7
update conversion script
eustlb 427bfb0
good default sliding window value
eustlb 1820457
udpdate generate
eustlb 1267459
update test checkpoint
eustlb 78a5c67
nit
eustlb 08ee7aa
fix mimi
eustlb 29cd507
fix codec prefix
eustlb 34b9b93
Merge branch 'huggingface:main' into moshi-asr
eustlb ac460b7
revert
eustlb ba80082
revert
eustlb 1b75b13
update config
eustlb b2c7b31
update config
eustlb 75fa51e
unnecessary mimi input restriction
eustlb 2f9b96e
remove delay in tokens
eustlb fa42f38
remove _prepare_4d_causal_attention_mask_with_cache_position and _upd…
eustlb f65960f
test update
eustlb 1541d27
modular update
eustlb 43255cd
make style
eustlb 694b202
resolve merge conflict
eustlb 73386fc
nit
eustlb 8676cc4
rename
eustlb 9277315
create codec model generation config at init
eustlb 45820e4
remove delay
eustlb 0d4e86e
max_new_tokens/length warning
eustlb e75bfaf
correct conv1 padding cache import for modular
eustlb 076761e
nit
eustlb 6bd979f
fix on encoder_past_key_values
eustlb 1469405
convert modular
eustlb c7f8e35
move frame_size to config
eustlb d37596d
move frame_size to config
eustlb 74af6ad
update test name
eustlb d7820af
handle first token is bos
eustlb 0edf0a5
better handling of max_new_tokens
eustlb 496285d
fix
eustlb 9e277cf
Merge branch 'main' into add-kyutai-stt
eustlb b3774f4
fix batch size in test input prep
eustlb 4505452
update docstring
eustlb f6f5adb
convert modular
eustlb 3922c8d
make style
eustlb b31f3d9
Merge branch 'main' into add-kyutai-stt
eustlb 4858154
make style
eustlb 7dbac18
Merge branch 'main' into add-kyutai-stt
eustlb 103a6e7
add feature extractor
eustlb f50d364
correct modular convention name for feature_extraction file
eustlb 3b63057
update convertion script
eustlb b0199e3
doc processor
eustlb df39c44
update doc
eustlb 17ca235
udpate init
eustlb 3eeee3d
update model type
eustlb 0738511
fixes
eustlb 0cbfab9
update tests
eustlb 1e016c8
fix
eustlb 25e6dde
make
eustlb dd429a4
Merge branch 'main' into add-kyutai-stt
eustlb 01ceb4b
add doc
eustlb cadfd6f
nit
eustlb aa11ab0
fix
eustlb ea542d2
doc
eustlb 023748e
auto mappings
eustlb 0d327f0
doc
eustlb d2a2802
nit
eustlb f326b97
convert modular
eustlb 67ae5c0
doc
eustlb 8ee7153
nit
eustlb f46bd17
extend _keep_in_fp32_modules to enforce fp32
eustlb e36504a
renaming to stt
eustlb 1f7685f
doc update + test update
eustlb fb5f8e3
doc fixes
eustlb 83b2366
doc fix
eustlb 48c7d7b
Merge branch 'main' into add-kyutai-stt
eustlb 5b19257
doc fix
eustlb 0725a4b
fix musicgen tests
eustlb 06f5ebf
fix musicgen tests
eustlb 0ef59b8
make style
eustlb b11a635
fix musicgen tests
eustlb a669b8e
correct frame_rate config param for mimi
eustlb ad3c364
update mimi test
eustlb 3ae13c9
revert update mimi test
eustlb faa4396
enforce cpu test
eustlb 002e7fe
move cache init in cache class
eustlb a9a46b6
convert modular
eustlb b52dd43
docstring update
eustlb 2251e9d
Merge branch 'main' into add-kyutai-stt
eustlb d6292fc
update model id
eustlb 63fe49c
feature_extractor -> feature_extraction (SEW)
eustlb 3bf8073
convert modular
eustlb be61fcd
Merge branch 'main' into add-kyutai-stt
eustlb c9ddfef
update model id
eustlb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,122 @@ | ||
| <!--Copyright 2025 The HuggingFace Team. All rights reserved. | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations under the License. | ||
|
|
||
| ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | ||
| rendered properly in your Markdown viewer. | ||
|
|
||
| --> | ||
|
|
||
| # Kyutai Speech-To-Text | ||
| ## Overview | ||
|
|
||
| Kyutai STT is a speech-to-text model architecture based on the [Mimi codec](https://huggingface.co/docs/transformers/en/model_doc/mimi), which encodes audio into discrete tokens in a streaming fashion, and a [Moshi-like](https://huggingface.co/docs/transformers/en/model_doc/moshi) autoregressive decoder. Kyutai’s lab has released two model checkpoints: | ||
| - [kyutai/stt-1b-en_fr](https://huggingface.co/kyutai/stt-1b-en_fr): a 1B-parameter model capable of transcribing both English and French | ||
| - [kyutai/stt-2.6b-en](https://huggingface.co/kyutai/stt-2.6b-en): a 2.6B-parameter model focused solely on English, optimized for maximum transcription accuracy | ||
|
|
||
| <div class="flex justify-center"> | ||
| <img src="https://huggingface.co/datasets/eustlb/documentation-images/resolve/main/kyutai_stt.png"/> | ||
| </div> | ||
|
|
||
| ## Usage Tips | ||
|
|
||
| ### Inference | ||
|
|
||
| ```python | ||
| import torch | ||
| from datasets import load_dataset, Audio | ||
| from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration | ||
|
|
||
| # 1. load the model and the processor | ||
| torch_device = "cuda" if torch.cuda.is_available() else "cpu" | ||
| model_id = "/home/eustache_lebihan/add-moshi-asr/stt-2.6b-en" | ||
|
|
||
| processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id) | ||
| model = KyutaiSpeechToTextForConditionalGeneration.from_pretrained(model_id, device_map=torch_device) | ||
|
|
||
| # 2. load audio samples | ||
| ds = load_dataset( | ||
| "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation" | ||
| ) | ||
| ds = ds.cast_column("audio", Audio(sampling_rate=24000)) | ||
|
|
||
| # 3. prepare the model inputs | ||
| inputs = processor( | ||
| ds[0]["audio"]["array"], | ||
| ) | ||
| inputs.to(torch_device) | ||
|
|
||
| # 4. infer the model | ||
| output_tokens = model.generate(**inputs) | ||
|
|
||
| # 5. decode the generated tokens | ||
| print(processor.batch_decode(output_tokens, skip_special_tokens=True)) | ||
| ``` | ||
|
|
||
| ### Batched Inference | ||
|
|
||
| ```python | ||
| import torch | ||
| from datasets import load_dataset, Audio | ||
| from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration | ||
|
|
||
| # 1. load the model and the processor | ||
| torch_device = "cuda" if torch.cuda.is_available() else "cpu" | ||
| model_id = "/home/eustache_lebihan/add-moshi-asr/stt-2.6b-en" | ||
|
|
||
| processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id) | ||
| model = KyutaiSpeechToTextForConditionalGeneration.from_pretrained(model_id, device_map=torch_device) | ||
|
|
||
| # 2. load audio samples | ||
| ds = load_dataset( | ||
| "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation" | ||
| ) | ||
| ds = ds.cast_column("audio", Audio(sampling_rate=24000)) | ||
|
|
||
| # 3. prepare the model inputs | ||
| audio_arrays = [ds[i]["audio"]["array"] for i in range(4)] | ||
| inputs = processor(audio_arrays, return_tensors="pt", padding=True) | ||
| inputs = inputs.to(torch_device) | ||
|
|
||
| # 4. infer the model | ||
| output_tokens = model.generate(**inputs) | ||
|
|
||
| # 5. decode the generated tokens | ||
| decoded_outputs = processor.batch_decode(output_tokens, skip_special_tokens=True) | ||
| for output in decoded_outputs: | ||
| print(output) | ||
| ``` | ||
|
|
||
| This model was contributed by [Eustache Le Bihan](https://huggingface.co/eustlb). | ||
| The original code can be found [here](https://github.com/kyutai-labs/moshi). | ||
|
|
||
|
|
||
| ## KyutaiSpeechToTextConfig | ||
|
|
||
| [[autodoc]] KyutaiSpeechToTextConfig | ||
|
|
||
| ## KyutaiSpeechToTextProcessor | ||
|
|
||
| [[autodoc]] KyutaiSpeechToTextProcessor | ||
| - __call__ | ||
|
|
||
| ## KyutaiSpeechToTextFeatureExtractor | ||
|
|
||
| [[autodoc]] KyutaiSpeechToTextFeatureExtractor | ||
|
|
||
| ## KyutaiSpeechToTextForConditionalGeneration | ||
|
|
||
| [[autodoc]] KyutaiSpeechToTextForConditionalGeneration | ||
| - forward | ||
| - generate | ||
|
|
||
| ## KyutaiSpeechToTextModel | ||
|
|
||
| [[autodoc]] KyutaiSpeechToTextModel |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed offline, let's extend
_keep_in_fp32_modulesfor more intuitive fonctionning