GitHub - MoonshotAI/Kimi-Audio: Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation

Kimi-Audio-7B 🤗 | Kimi-Audio-7B-Instruct 🤗 | 📑 Paper

We present Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation. This repository contains the official implementation, models, and evaluation toolkit for Kimi-Audio.

🔥🔥🔥 News!!

May 29, 2025: 👋 We release a finetuning example of Kimi-Audio-7B.
April 27, 2025: 👋 We release pretrained model weights of Kimi-Audio-7B.
April 25, 2025: 👋 We release the inference code and model weights of Kimi-Audio-7B-Instruct.
April 25, 2025: 👋 We release the audio evaluation toolkit Kimi-Audio-Evalkit. We can easily reproduce the our results and baselines by this toolkit!
April 25, 2025: 👋 We release the technical report of Kimi-Audio.

Introduction

Kimi-Audio is designed as a universal audio foundation model capable of handling a wide variety of audio processing tasks within a single unified framework. Key features include:

Universal Capabilities: Handle diverse tasks like automatic speech recognition (ASR), audio question answering (AQA), automatic audio captioning (AAC), speech emotion recognition (SER), sound event/scene classification (SEC/ASC), and end-to-end speech conversation.
State-of-the-Art Performance: Achieve SOTA results on numerous audio benchmarks (see Evaluation and the Technical Report).
Large-Scale Pre-training: Pre-train on over 13 million hours of diverse audio data (speech, music, sounds) and text data, enabling robust audio reasoning and language understanding.
Novel Architecture: Employ a hybrid audio input (continuous acoustic vectors + discrete semantic tokens) and an LLM core with parallel heads for text and audio token generation.
Efficient Inference: Feature a chunk-wise streaming detokenizer based on flow matching for low-latency audio generation.
Open-Source: Release the code and model checkpoints for both pre-training and instruction fine-tuning, and release a comprehensive evaluation toolkit to foster community research and development.

Architecture Overview

Kimi-Audio consists of three main components:

Audio Tokenizer: Converts input audio into:
- Discrete semantic tokens (12.5Hz) using vector quantization.
- Continuous acoustic features derived from a Whisper encoder (downsampled to 12.5Hz).
Audio LLM: A transformer-based model (initialized from a pre-trained text LLM like Qwen 2.5 7B) with shared layers processing multimodal inputs, followed by parallel heads for autoregressively generating text tokens and discrete audio semantic tokens.
Audio Detokenizer: Converts the predicted discrete semantic audio tokens back into high-fidelity waveforms using a flow-matching model and a vocoder (BigVGAN), supporting chunk-wise streaming with a look-ahead mechanism for low latency.

Getting Started

Step1: Get the Code

git clone https://github.com/MoonshotAI/Kimi-Audio.git
cd Kimi-Audio
git submodule update --init --recursive
pip install -r requirements.txt

Kimi‑Audio can now be installed directly via pip.

pip install torch
pip install git+https://github.com/MoonshotAI/Kimi-Audio.git

Quick Start

This example demonstrates basic usage for generating text from audio (ASR) and generating both text and speech in a conversational turn.

import soundfile as sf
from kimia_infer.api.kimia import KimiAudio

# --- 1. Load Model ---
model_path = "moonshotai/Kimi-Audio-7B-Instruct" 
model = KimiAudio(model_path=model_path, load_detokenizer=True)

# --- 2. Define Sampling Parameters ---
sampling_params = {
    "audio_temperature": 0.8,
    "audio_top_k": 10,
    "text_temperature": 0.0,
    "text_top_k": 5,
    "audio_repetition_penalty": 1.0,
    "audio_repetition_window_size": 64,
    "text_repetition_penalty": 1.0,
    "text_repetition_window_size": 16,
}

# --- 3. Example 1: Audio-to-Text (ASR) ---
messages_asr = [
    # You can provide context or instructions as text
    {"role": "user", "message_type": "text", "content": "Please transcribe the following audio:"},
    # Provide the audio file path
    {"role": "user", "message_type": "audio", "content": "test_audios/asr_example.wav"}
]

# Generate only text output
_, text_output = model.generate(messages_asr, **sampling_params, output_type="text")
print(">>> ASR Output Text: ", text_output) # Expected output: "这并不是告别，这是一个篇章的结束，也是新篇章的开始。"


# --- 4. Example 2: Audio-to-Audio/Text Conversation ---
messages_conversation = [
    # Start conversation with an audio query
    {"role": "user", "message_type": "audio", "content": "test_audios/qa_example.wav"}
]

# Generate both audio and text output
wav_output, text_output = model.generate(messages_conversation, **sampling_params, output_type="both")

# Save the generated audio
output_audio_path = "output_audio.wav"
sf.write(output_audio_path, wav_output.detach().cpu().view(-1).numpy(), 24000) # Assuming 24kHz output
print(f">>> Conversational Output Audio saved to: {output_audio_path}")
print(">>> Conversational Output Text: ", text_output) # Expected output: "当然可以，这很简单。一二三四五六七八九十。"

# --- 5. Example 3: Audio-to-Audio/Text Conversation with Multiturn ---

messages = [
    {"role": "user", "message_type": "audio", "content": "test_audios/multiturn/case2/multiturn_q1.wav"},
    # This is the first turn output of Kimi-Audio
    {"role": "assistant", "message_type": "audio-text", "content": ["test_audios/multiturn/case2/multiturn_a1.wav", "当然可以，这很简单。一二三四五六七八九十。"]},
    {"role": "user", "message_type": "audio", "content": "test_audios/multiturn/case2/multiturn_q2.wav"}
]
wav, text = model.generate(messages, **sampling_params, output_type="both")


# Generate both audio and text output
wav_output, text_output = model.generate(messages_conversation, **sampling_params, output_type="both")

# Save the generated audio
output_audio_path = "output_audio.wav"
sf.write(output_audio_path, wav_output.detach().cpu().view(-1).numpy(), 24000) # Assuming 24kHz output
print(f">>> Conversational Output Audio saved to: {output_audio_path}")
print(">>> Conversational Output Text: ", text_output) # Expected output: "没问题，继续数下去就是十一十二十三十四十五十六十七十八十九二十。"

print("Kimi-Audio inference examples complete.")

Evaluation

Kimi-Audio achieves state-of-the-art (SOTA) performance across a wide range of audio benchmarks.

The below is the overall performance:

Here are performances on different benchmarks, you can easily reproduce the our results and baselines by our Kimi-Audio-Evalkit (also see Evaluation Toolkit):

Automatic Speech Recognition (ASR)

Datasets	Model	Performance (WER↓)
LibriSpeech test-clean \| test-other	Qwen2-Audio-base	1.74 \| 4.04
	Baichuan-base	3.02 \| 6.04
	Step-Audio-chat	3.19 \| 10.67
	Qwen2.5-Omni	2.37 \| 4.21
	Kimi-Audio	1.28 \| 2.42
Fleurs zh \| en	Qwen2-Audio-base	3.63 \| 5.20
	Baichuan-base	4.15 \| 8.07
	Step-Audio-chat	4.26 \| 8.56
	Qwen2.5-Omni	2.92 \| 4.17
	Kimi-Audio	2.69 \| 4.44
AISHELL-1	Qwen2-Audio-base	1.52
	Baichuan-base	1.93
	Step-Audio-chat	2.14
	Qwen2.5-Omni	1.13
	Kimi-Audio	0.60
AISHELL-2 ios	Qwen2-Audio-base	3.08
	Baichuan-base	3.87
	Step-Audio-chat	3.89
	Qwen2.5-Omni	2.56
	Kimi-Audio	2.56
WenetSpeech test-meeting \| test-net	Qwen2-Audio-base	8.40 \| 7.64
	Baichuan-base	13.28 \| 10.13
	Step-Audio-chat	10.83 \| 9.47
	Qwen2.5-Omni	7.71 \| 6.04
	Kimi-Audio	6.28 \| 5.37
Kimi-ASR Internal Testset subset1 \| subset2	Qwen2-Audio-base	2.31 \| 3.24
	Baichuan-base	3.41 \| 5.60
	Step-Audio-chat	2.82 \| 4.74
	Qwen2.5-Omni	1.53 \| 2.68
	Kimi-Audio	1.42 \| 2.44

Audio Understanding

Datasets	Model	Performance↑
MMAU music \| sound \| speech	Qwen2-Audio-base	58.98 \| 69.07 \| 52.55
	Baichuan-chat	49.10 \| 59.46 \| 42.47
	GLM-4-Voice	38.92 \| 43.54 \| 32.43
	Step-Audio-chat	49.40 \| 53.75 \| 47.75
	Qwen2.5-Omni	62.16 \| 67.57 \| 53.92
	Kimi-Audio	61.68 \| 73.27 \| 60.66
ClothoAQA test \| dev	Qwen2-Audio-base	71.73 \| 72.63
	Baichuan-chat	48.02 \| 48.16
	Step-Audio-chat	45.84 \| 44.98
	Qwen2.5-Omni	72.86 \| 73.12
	Kimi-Audio	71.24 \| 73.18
VocalSound	Qwen2-Audio-base	93.82
	Baichuan-base	58.17
	Step-Audio-chat	28.58
	Qwen2.5-Omni	93.73
	Kimi-Audio	94.85
Nonspeech7k	Qwen2-Audio-base	87.17
	Baichuan-chat	59.03
	Step-Audio-chat	21.38
	Qwen2.5-Omni	69.89
	Kimi-Audio	93.93
MELD	Qwen2-Audio-base	51.23
	Baichuan-chat	23.59
	Step-Audio-chat	33.54
	Qwen2.5-Omni	49.83
	Kimi-Audio	59.13
TUT2017	Qwen2-Audio-base	33.83
	Baichuan-base	27.9
	Step-Audio-chat	7.41
	Qwen2.5-Omni	43.27
	Kimi-Audio	65.25
CochlScene test \| dev	Qwen2-Audio-base	52.69 \| 50.96
	Baichuan-base	34.93 \| 34.56
	Step-Audio-chat	10.06 \| 10.42
	Qwen2.5-Omni	63.82 \| 63.82
	Kimi-Audio	79.84 \| 80.99

Audio-to-Text Chat

Datasets	Model	Performance↑
OpenAudioBench AlpacaEval \| Llama Questions \| Reasoning QA \| TriviaQA \| Web Questions	Qwen2-Audio-chat	57.19 \| 69.67 \| 42.77 \| 40.30 \| 45.20
	Baichuan-chat	59.65 \| 74.33 \| 46.73 \| 55.40 \| 58.70
	GLM-4-Voice	57.89 \| 76.00 \| 47.43 \| 51.80 \| 55.40
	StepAudio-chat	56.53 \| 72.33 \| 60.00 \| 56.80 \| 73.00
	Qwen2.5-Omni	72.76 \| 75.33 \| 63.76 \| 57.06 \| 62.80
	Kimi-Audio	75.73 \| 79.33 \| 58.02 \| 62.10 \| 70.20
VoiceBench AlpacaEval \| CommonEval \| SD-QA \| MMSU	Qwen2-Audio-chat	3.69 \| 3.40 \| 35.35 \| 35.43
	Baichuan-chat	4.00 \| 3.39 \| 49.64 \| 48.80
	GLM-4-Voice	4.06 \| 3.48 \| 43.31 \| 40.11
	StepAudio-chat	3.99 \| 2.99 \| 46.84 \| 28.72
	Qwen2.5-Omni	4.33 \| 3.84 \| 57.41 \| 56.38
	Kimi-Audio	4.46 \| 3.97 \| 63.12 \| 62.17
VoiceBench OpenBookQA \| IFEval \| AdvBench \| Avg	Qwen2-Audio-chat	49.01 \| 22.57 \| 98.85 \| 54.72
	Baichuan-chat	63.30 \| 41.32 \| 86.73 \| 62.51
	GLM-4-Voice	52.97 \| 24.91 \| 88.08 \| 57.17
	StepAudio-chat	31.87 \| 29.19 \| 65.77 \| 48.86
	Qwen2.5-Omni	79.12 \| 53.88 \| 99.62 \| 72.83
	Kimi-Audio	83.52 \| 61.10 \| 100.00 \| 76.93

Speech Conversation

Performance of Kimi-Audio and baseline models on speech conversation.

Model	Ability
Model	Speed Control	Accent Control	Emotion Control	Empathy	Style Control	Avg
GPT-4o	4.21	3.65	4.05	3.87	4.54	4.06
Step-Audio-chat	3.25	2.87	3.33	3.05	4.14	3.33
GLM-4-Voice	3.83	3.51	3.77	3.07	4.04	3.65
GPT-4o-mini	3.15	2.71	4.24	3.16	4.01	3.45
Kimi-Audio	4.30	3.45	4.27	3.39	4.09	3.90

Finetune

We release the pre-trained model and the lightweight finetune codes. Please refer to the finetune_codes/README.md for more details.

Evaluation Toolkit

Evaluating and comparing audio foundation models is challenging due to inconsistent metrics, varying inference configurations, and a lack of standardized generation evaluation. To address this, we developed and open-sourced an Evaluation Toolkit.

Key features:

Integrates Kimi-Audio and other recent audio LLMs.
Implements standardized metric calculation and integrates LLMs for intelligent judging (e.g., for AQA).
Provides a unified platform for side-by-side comparisons with shareable inference 'recipes' for reproducibility.
Includes a benchmark for evaluating speech conversation abilities (control, empathy, style).

We encourage the community to use and contribute to this toolkit to foster more reliable and comparable benchmarking. Find it here: Kimi-Audio-Evalkit.

Generation Testset

We collect and release Kimi-Audio-Generation-Testset, which is designed to benchmark and evaluate the conversational capabilities of audio-based dialogue models. It consists of a collection of audio files containing various instructions and conversational prompts. The primary goal is to assess a model's ability to generate not just relevant, but also appropriately styled audio responses. The language in dataset is Chinese.

License

The model is based and modified from Qwen 2.5-7B. Code derived from Qwen2.5-7B is licensed under the Apache 2.0 License. Other parts of the code are licensed under the MIT License.

Acknowledgements

We would like to thank the following projects and individuals for their contributions to the development of Kimi-Audio:

Thank you to all the open-source projects for their contributions to this project!

Citation

If you find Kimi-Audio useful in your research or applications, please cite our technical report:

@misc{kimiteam2025kimiaudiotechnicalreport,
      title={Kimi-Audio Technical Report}, 
      author={KimiTeam and Ding Ding and Zeqian Ju and Yichong Leng and Songxiang Liu and Tong Liu and Zeyu Shang and Kai Shen and Wei Song and Xu Tan and Heyi Tang and Zhengtao Wang and Chu Wei and Yifei Xin and Xinran Xu and Jianwei Yu and Yutao Zhang and Xinyu Zhou and Y. Charles and Jun Chen and Yanru Chen and Yulun Du and Weiran He and Zhenxing Hu and Guokun Lai and Qingcheng Li and Yangyang Liu and Weidong Sun and Jianzhou Wang and Yuzhi Wang and Yuefeng Wu and Yuxin Wu and Dongchao Yang and Hao Yang and Ying Yang and Zhilin Yang and Aoxiong Yin and Ruibin Yuan and Yutong Zhang and Zaida Zhou},
      year={2025},
      eprint={2504.18425},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2504.18425}, 
}

Contact Us

For questions, issues, or collaboration inquiries, please feel free to open an issue on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
assets		assets
finetune_codes		finetune_codes
kimia_infer		kimia_infer
test_audios		test_audios
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
README.md		README.md
finetune.py		finetune.py
infer.py		infer.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🔥🔥🔥 News!!

Table of Contents

Introduction

Architecture Overview

Getting Started

Step1: Get the Code

Quick Start

Evaluation

Automatic Speech Recognition (ASR)

Audio Understanding

Audio-to-Text Chat

Speech Conversation

Finetune

Evaluation Toolkit

Generation Testset

License

Acknowledgements

Citation

Contact Us

About

Uh oh!

Uh oh!

Contributors 6

Languages

MoonshotAI/Kimi-Audio

Folders and files

Latest commit

History

Repository files navigation

🔥🔥🔥 News!!

Table of Contents

Introduction

Architecture Overview

Getting Started

Step1: Get the Code

Quick Start

Evaluation

Automatic Speech Recognition (ASR)

Audio Understanding

Audio-to-Text Chat

Speech Conversation

Finetune

Evaluation Toolkit

Generation Testset

License

Acknowledgements

Citation

Contact Us

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 6

Languages