Skip to content
Merged
Show file tree
Hide file tree
Changes from 91 commits
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
936792a
first draft
eustlb May 14, 2025
111f3ea
Merge branch 'huggingface:main' into moshi-asr
eustlb May 14, 2025
536e55d
cleaner version
eustlb May 17, 2025
53f9743
Merge branch 'huggingface:main' into moshi-asr
eustlb May 26, 2025
22be12c
udpate tests + modeling
eustlb May 26, 2025
ca508a0
add tests
eustlb May 26, 2025
f83bbb0
init
eustlb May 26, 2025
ff510f7
udpate test_modeling_common
eustlb May 26, 2025
af3ff35
fix tests
eustlb May 26, 2025
efd6736
csm Processor draft
eustlb May 28, 2025
c132465
convertion update
eustlb May 28, 2025
9966e56
mimi cache padding convolutions draft
eustlb Jun 3, 2025
4c1feb3
mimi streaming udpates
eustlb Jun 4, 2025
72289ef
update mimi padding cache test
eustlb Jun 4, 2025
8023ebd
udpate cache padding mimi test
eustlb Jun 5, 2025
b7dccd3
make style mimi
eustlb Jun 5, 2025
7242952
updates generate moshi asr
eustlb Jun 5, 2025
135730e
moshi asr integration tests (single + batched)
eustlb Jun 5, 2025
0679d29
update tests
eustlb Jun 5, 2025
99322b7
update conversion script
eustlb Jun 5, 2025
427bfb0
good default sliding window value
eustlb Jun 5, 2025
1820457
udpdate generate
eustlb Jun 5, 2025
1267459
update test checkpoint
eustlb Jun 6, 2025
78a5c67
nit
eustlb Jun 6, 2025
08ee7aa
fix mimi
eustlb Jun 6, 2025
29cd507
fix codec prefix
eustlb Jun 6, 2025
34b9b93
Merge branch 'huggingface:main' into moshi-asr
eustlb Jun 10, 2025
ac460b7
revert
eustlb Jun 10, 2025
ba80082
revert
eustlb Jun 10, 2025
1b75b13
update config
eustlb Jun 18, 2025
b2c7b31
update config
eustlb Jun 18, 2025
75fa51e
unnecessary mimi input restriction
eustlb Jun 18, 2025
2f9b96e
remove delay in tokens
eustlb Jun 18, 2025
fa42f38
remove _prepare_4d_causal_attention_mask_with_cache_position and _upd…
eustlb Jun 18, 2025
f65960f
test update
eustlb Jun 18, 2025
1541d27
modular update
eustlb Jun 18, 2025
43255cd
make style
eustlb Jun 18, 2025
694b202
resolve merge conflict
eustlb Jun 19, 2025
73386fc
nit
eustlb Jun 19, 2025
8676cc4
rename
eustlb Jun 19, 2025
9277315
create codec model generation config at init
eustlb Jun 19, 2025
45820e4
remove delay
eustlb Jun 19, 2025
0d4e86e
max_new_tokens/length warning
eustlb Jun 19, 2025
e75bfaf
correct conv1 padding cache import for modular
eustlb Jun 19, 2025
076761e
nit
eustlb Jun 19, 2025
6bd979f
fix on encoder_past_key_values
eustlb Jun 19, 2025
1469405
convert modular
eustlb Jun 19, 2025
c7f8e35
move frame_size to config
eustlb Jun 19, 2025
d37596d
move frame_size to config
eustlb Jun 19, 2025
74af6ad
update test name
eustlb Jun 19, 2025
d7820af
handle first token is bos
eustlb Jun 19, 2025
0edf0a5
better handling of max_new_tokens
eustlb Jun 19, 2025
496285d
fix
eustlb Jun 19, 2025
9e277cf
Merge branch 'main' into add-kyutai-stt
eustlb Jun 19, 2025
b3774f4
fix batch size in test input prep
eustlb Jun 19, 2025
4505452
update docstring
eustlb Jun 19, 2025
f6f5adb
convert modular
eustlb Jun 19, 2025
3922c8d
make style
eustlb Jun 19, 2025
b31f3d9
Merge branch 'main' into add-kyutai-stt
eustlb Jun 19, 2025
4858154
make style
eustlb Jun 19, 2025
7dbac18
Merge branch 'main' into add-kyutai-stt
eustlb Jun 20, 2025
103a6e7
add feature extractor
eustlb Jun 20, 2025
f50d364
correct modular convention name for feature_extraction file
eustlb Jun 20, 2025
3b63057
update convertion script
eustlb Jun 20, 2025
b0199e3
doc processor
eustlb Jun 20, 2025
df39c44
update doc
eustlb Jun 20, 2025
17ca235
udpate init
eustlb Jun 20, 2025
3eeee3d
update model type
eustlb Jun 20, 2025
0738511
fixes
eustlb Jun 20, 2025
0cbfab9
update tests
eustlb Jun 21, 2025
1e016c8
fix
eustlb Jun 21, 2025
25e6dde
make
eustlb Jun 21, 2025
dd429a4
Merge branch 'main' into add-kyutai-stt
eustlb Jun 21, 2025
01ceb4b
add doc
eustlb Jun 21, 2025
cadfd6f
nit
eustlb Jun 21, 2025
aa11ab0
fix
eustlb Jun 21, 2025
ea542d2
doc
eustlb Jun 21, 2025
023748e
auto mappings
eustlb Jun 21, 2025
0d327f0
doc
eustlb Jun 21, 2025
d2a2802
nit
eustlb Jun 21, 2025
f326b97
convert modular
eustlb Jun 21, 2025
67ae5c0
doc
eustlb Jun 21, 2025
8ee7153
nit
eustlb Jun 21, 2025
f46bd17
extend _keep_in_fp32_modules to enforce fp32
eustlb Jun 23, 2025
e36504a
renaming to stt
eustlb Jun 23, 2025
1f7685f
doc update + test update
eustlb Jun 23, 2025
fb5f8e3
doc fixes
eustlb Jun 23, 2025
83b2366
doc fix
eustlb Jun 23, 2025
48c7d7b
Merge branch 'main' into add-kyutai-stt
eustlb Jun 23, 2025
5b19257
doc fix
eustlb Jun 23, 2025
0725a4b
fix musicgen tests
eustlb Jun 23, 2025
06f5ebf
fix musicgen tests
eustlb Jun 23, 2025
0ef59b8
make style
eustlb Jun 23, 2025
b11a635
fix musicgen tests
eustlb Jun 23, 2025
a669b8e
correct frame_rate config param for mimi
eustlb Jun 24, 2025
ad3c364
update mimi test
eustlb Jun 24, 2025
3ae13c9
revert update mimi test
eustlb Jun 24, 2025
faa4396
enforce cpu test
eustlb Jun 24, 2025
002e7fe
move cache init in cache class
eustlb Jun 24, 2025
a9a46b6
convert modular
eustlb Jun 24, 2025
b52dd43
docstring update
eustlb Jun 24, 2025
2251e9d
Merge branch 'main' into add-kyutai-stt
eustlb Jun 24, 2025
d6292fc
update model id
eustlb Jun 24, 2025
63fe49c
feature_extractor -> feature_extraction (SEW)
eustlb Jun 24, 2025
3bf8073
convert modular
eustlb Jun 24, 2025
be61fcd
Merge branch 'main' into add-kyutai-stt
eustlb Jun 24, 2025
c9ddfef
update model id
eustlb Jun 24, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -841,6 +841,8 @@
title: GraniteSpeech
- local: model_doc/hubert
title: Hubert
- local: model_doc/stt
title: Kyutai Speech-To-Text
- local: model_doc/mctct
title: MCTCT
- local: model_doc/mimi
Expand Down
122 changes: 122 additions & 0 deletions docs/source/en/model_doc/stt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# Kyutai Speech-To-Text
## Overview

Kyutai STT is a speech-to-text model architecture based on the [Mimi codec](https://huggingface.co/docs/transformers/en/model_doc/mimi), which encodes audio into discrete tokens in a streaming fashion, and a [Moshi-like](https://huggingface.co/docs/transformers/en/model_doc/moshi) autoregressive decoder. Kyutai’s lab has released two model checkpoints:
- [kyutai/stt-1b-en_fr](https://huggingface.co/kyutai/stt-1b-en_fr): a 1B-parameter model capable of transcribing both English and French
- [kyutai/stt-2.6b-en](https://huggingface.co/kyutai/stt-2.6b-en): a 2.6B-parameter model focused solely on English, optimized for maximum transcription accuracy

<div class="flex justify-center">
<img src="https://huggingface.co/datasets/eustlb/documentation-images/resolve/main/kyutai_stt.png"/>
</div>

## Usage Tips

### Inference

```python
import torch
from datasets import load_dataset, Audio
from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration

# 1. load the model and the processor
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "/home/eustache_lebihan/add-moshi-asr/stt-2.6b-en"

processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id)
model = KyutaiSpeechToTextForConditionalGeneration.from_pretrained(model_id, device_map=torch_device)

# 2. load audio samples
ds = load_dataset(
"hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
)
ds = ds.cast_column("audio", Audio(sampling_rate=24000))

# 3. prepare the model inputs
inputs = processor(
ds[0]["audio"]["array"],
)
inputs.to(torch_device)

# 4. infer the model
output_tokens = model.generate(**inputs)

# 5. decode the generated tokens
print(processor.batch_decode(output_tokens, skip_special_tokens=True))
```

### Batched Inference

```python
import torch
from datasets import load_dataset, Audio
from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration

# 1. load the model and the processor
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "/home/eustache_lebihan/add-moshi-asr/stt-2.6b-en"

processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id)
model = KyutaiSpeechToTextForConditionalGeneration.from_pretrained(model_id, device_map=torch_device)

# 2. load audio samples
ds = load_dataset(
"hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
)
ds = ds.cast_column("audio", Audio(sampling_rate=24000))

# 3. prepare the model inputs
audio_arrays = [ds[i]["audio"]["array"] for i in range(4)]
inputs = processor(audio_arrays, return_tensors="pt", padding=True)
inputs = inputs.to(torch_device)

# 4. infer the model
output_tokens = model.generate(**inputs)

# 5. decode the generated tokens
decoded_outputs = processor.batch_decode(output_tokens, skip_special_tokens=True)
for output in decoded_outputs:
print(output)
```

This model was contributed by [Eustache Le Bihan](https://huggingface.co/eustlb).
The original code can be found [here](https://github.com/kyutai-labs/moshi).


## KyutaiSpeechToTextConfig

[[autodoc]] KyutaiSpeechToTextConfig

## KyutaiSpeechToTextProcessor

[[autodoc]] KyutaiSpeechToTextProcessor
- __call__

## KyutaiSpeechToTextFeatureExtractor

[[autodoc]] KyutaiSpeechToTextFeatureExtractor

## KyutaiSpeechToTextForConditionalGeneration

[[autodoc]] KyutaiSpeechToTextForConditionalGeneration
- forward
- generate

## KyutaiSpeechToTextModel

[[autodoc]] KyutaiSpeechToTextModel
5 changes: 4 additions & 1 deletion src/transformers/modeling_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -4657,8 +4657,11 @@ def from_pretrained(
# The _keep_in_fp32_modules flag is only used to avoid bf16 -> fp16 casting precision issues. It was introduced
# in case of force loading a model that should stay bf16 in fp16 (which includes a few quantizers as this is a pre-processing
# step for e.g. bitsandbytes). See https://github.com/huggingface/transformers/issues/20287 for details.
# Update: to extend _keep_in_fp32_modules flag feature, it can also be used to force modules that should stay in fp32
if model._keep_in_fp32_modules is not None and (
torch_dtype == torch.float16 or getattr(hf_quantizer, "use_keep_in_fp32_modules", False)
torch_dtype == torch.float16
or torch_dtype == torch.bfloat16
or getattr(hf_quantizer, "use_keep_in_fp32_modules", False)
Comment on lines +4661 to +4665
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed offline, let's extend _keep_in_fp32_modules for more intuitive fonctionning

):
# We need to match exact layers, so we add either `.` on each side, or start/end of string
keep_in_fp32_regex = re.compile(
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -284,6 +284,7 @@
from .squeezebert import *
from .stablelm import *
from .starcoder2 import *
from .stt import *
from .superglue import *
from .superpoint import *
from .swiftformer import *
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -321,6 +321,7 @@
("squeezebert", "SqueezeBertConfig"),
("stablelm", "StableLmConfig"),
("starcoder2", "Starcoder2Config"),
("stt", "KyutaiSpeechToTextConfig"),
("superglue", "SuperGlueConfig"),
("superpoint", "SuperPointConfig"),
("swiftformer", "SwiftFormerConfig"),
Expand Down Expand Up @@ -705,6 +706,7 @@
("squeezebert", "SqueezeBERT"),
("stablelm", "StableLm"),
("starcoder2", "Starcoder2"),
("stt", "KyutaiSpeechToText"),
("superglue", "SuperGlue"),
("superpoint", "SuperPoint"),
("swiftformer", "SwiftFormer"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/feature_extraction_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@
("sew-d", "Wav2Vec2FeatureExtractor"),
("speech_to_text", "Speech2TextFeatureExtractor"),
("speecht5", "SpeechT5FeatureExtractor"),
("stt", "KyutaiSpeechToTextFeatureExtractor"),
("swiftformer", "ViTFeatureExtractor"),
("swin", "ViTFeatureExtractor"),
("swinv2", "ViTFeatureExtractor"),
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -299,6 +299,7 @@
("squeezebert", "SqueezeBertModel"),
("stablelm", "StableLmModel"),
("starcoder2", "Starcoder2Model"),
("stt", "KyutaiSpeechToTextModel"),
("superglue", "SuperGlueForKeypointMatching"),
("swiftformer", "SwiftFormerModel"),
("swin", "SwinModel"),
Expand Down Expand Up @@ -1053,6 +1054,7 @@
("speech-encoder-decoder", "SpeechEncoderDecoderModel"),
("speech_to_text", "Speech2TextForConditionalGeneration"),
("speecht5", "SpeechT5ForSpeechToText"),
("stt", "KyutaiSpeechToTextForConditionalGeneration"),
("whisper", "WhisperForConditionalGeneration"),
]
)
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,7 @@
("speech_to_text", "Speech2TextProcessor"),
("speech_to_text_2", "Speech2Text2Processor"),
("speecht5", "SpeechT5Processor"),
("stt", "KyutaiSpeechToTextProcessor"),
("trocr", "TrOCRProcessor"),
("tvlt", "TvltProcessor"),
("tvp", "TvpProcessor"),
Expand Down
8 changes: 8 additions & 0 deletions src/transformers/models/mimi/configuration_mimi.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,8 @@ class MimiConfig(PretrainedConfig):
use_cache (`bool`, *optional*, defaults to `False`):
Whether or not the model should return the last key/values attentions (not used by all models). Only
relevant if `config.is_decoder=True`.
use_streaming (`bool`, *optional*, defaults to `False`):
Whether to use streaming mode. If `True`, the model encode method will return the padding cache that can be used in a subsequent call to the encode method.
rope_theta (`float`, *optional*, defaults to 10000.0):
The base period of the RoPE embeddings.
sliding_window (`int`, *optional*, defaults to 250):
Expand Down Expand Up @@ -172,6 +174,7 @@ def __init__(
initializer_range=0.02,
norm_eps=1e-5,
use_cache=False,
use_streaming=False,
rope_theta=10000.0,
sliding_window=250,
attention_dropout=0.0,
Expand Down Expand Up @@ -209,6 +212,7 @@ def __init__(
self.initializer_range = initializer_range
self.norm_eps = norm_eps
self.use_cache = use_cache
self.use_streaming = use_streaming
self.rope_theta = rope_theta
self.sliding_window = sliding_window
self.attention_dropout = attention_dropout
Expand All @@ -233,5 +237,9 @@ def num_codebooks(self) -> int:
# alias to num_quantizers
return self.num_quantizers

@property
def frame_size(self) -> int:
return int(self.sampling_rate / self.frame_rate)


__all__ = ["MimiConfig"]
Loading
Loading