Yunming Liang, Zihao Chen, ChaoFan Ding, Xinhan Di
DeepSound-V1 is a framework enabling audio generation from videos towards initial step-by-step thinking without extra annotations based on the internal chain-of-thought (CoT) of Multi-modal large language model(MLLM).
Video-to-audio results on the VGGSound test set. The bold text highlights the superior performance of our proposed method compared to previous methods, while the green text in brackets represents the improvement rate of each index.
V2A Model                   | Method                         | ||||||||
---|---|---|---|---|---|---|---|---|---|
MMAudio-S-44k | Direct & Ori-Set | 65.25 | 5.55 | 1.66 | 1.67 | 1.44 | 18.02 | 32.27 | 0.444 |
Direct & VO-Free | 65.47 | 5.77 | 1.03 | 2.22 | 1.82 | 13.32 | 31.16 | 0.487 | |
Direct-neg & Ori-Set | 68.44 | 6.48 | 1.71 | 2.27 | 1.84 | 13.74 | 30.51 | 0.505 | |
Our best & VO-Free | 65.07(0.27%) | 6.08 | 1.02(38.61%) | 2.20 | 1.82 | 13.39 | 30.82 | 0.496 | |
MMAudio-M-44k | Direct & Ori-Set | 61.88 | 4.74 | 1.13 | 1.66 | 1.41 | 17.41 | 32.99 | 0.443 |
Direct & VO-Free | 56.07 | 4.57 | 0.99 | 2.15 | 1.74 | 13.91 | 32.19 | 0.479 | |
Direct-neg & Ori-Set | 60.21 | 4.79 | 1.66 | 2.20 | 1.76 | 14.68 | 32.13 | 0.486 | |
Our best & VO-Free | 55.65(10.07%) | 4.80 | 0.93(17.70%) | 2.15 | 1.77 | 13.82 | 31.44 | 0.495 | |
MMAudio-L-44k | Direct & Ori-Set | 60.60 | 4.72 | 0.97 | 1.65 | 1.40 | 17.40 | 33.22 | 0.442 |
Direct & VO-Free | 56.29 | 4.29 | 1.03 | 2.13 | 1.72 | 14.54 | 32.74 | 0.475 | |
Direct-neg & Ori-Set | 59.50 | 4.62 | 1.75 | 2.19 | 1.76 | 15.42 | 32.36 | 0.490 | |
Our best & VO-Free | 55.19(8.93%) | 4.42(6.36%) | 0.95(2.06%) | 2.13 | 1.75 | 14.49 | 31.94 | 0.490 | |
YingSound | Direct & Ori-Set | 69.37 | 6.28 | 0.78 | 1.70 | 1.41 | 14.02 | 27.75 | 0.956 |
Direct & VO-Free | 68.78 | 5.33 | 0.70 | 1.74 | 1.45 | 14.63 | 27.75 | 0.956 | |
Direct-neg & Ori-Set | 77.86 | 7.37 | 0.75 | 2.20 | 1.83 | 12.48 | 27.15 | 0.991 | |
Our best & VO-Free | 68.95(0.60%) | 5.57(11.32%) | 0.72(8.32%) | 1.73 | 1.45 | 14.71(4.95%) | 27.56 | 0.962 | |
FoleyCrafter | Direct & Ori-Set | 140.09 | 19.67 | 2.51 | 2.30 | 2.23 | 15.58 | 25.68 | 1.225 |
Direct & VO-Free | 130.67 | 17.59 | 2.12 | 2.59 | 2.28 | 9.94 | 27.96 | 1.215 | |
Direct-neg & Ori-Set | 181.45 | 21.17 | 3.17 | 2.73 | 2.43 | 10.48 | 27.34 | 1.223 | |
Our best & VO-Free | 127.97(8.65%) | 17.39(11.62%) | 2.12(15.42%) | 2.57 | 2.29 | 9.96 | 27.43(6.39%) | 1.214(0.89%) |
Ablation result on MMAudio-L-44k. The improvement between baseline and ours is represented as bold text, demonstrating effectiveness of the learned CoT reasoning in enhancing the final audio quality, the improvement between Ours-s3 and Ours-s4 is represented as bold and italic text.
Method                                  | ||||||||
---|---|---|---|---|---|---|---|---|
setting1 & vggtest-gt1 | 60.60 | 4.72 | 0.97 | 1.65 | 1.40 | 17.40 | 33.22 | 0.442 |
setting1 & vggtest-gt2 | 56.29 | 4.29 | 1.03 | 2.13 | 1.72 | 14.54 | 32.74 | 0.475 |
setting1-neg & vggtest-gt1 | 59.50 | 4.62 | 1.75 | 2.19 | 1.76 | 15.42 | 32.36 | 0.490 |
setting2-cot & vggtest-gt2 | 55.19(8.93%) | 4.42(6.36%) | 0.95(2.06%) | 2.13 | 1.75 | 14.49 | 31.94 | 0.490 |
setting3-rm & vggtest-gt2 | 55.75 | 4.49 | 1.00 | 2.12 | 1.73 | 14.70 | 32.25 | 0.484 |
setting3-rep & vggtest-gt2 | 55.66(8.15%) | 4.45(5.72%) | 0.97(0.00%) | 2.14 | 1.74(0.57%) | 14.61(0.83%) | 32.16(0.69%) | 0.486(0.82%) |
setting3-neg & vggtest-gt2 | 55.66 | 4.44 | 0.99 | 2.13 | 1.74 | 14.65 | 32.17 | 0.487 |
Overall process:
output_hevc.mp4
deepsound-demo.mp4
Direct Result |
Step-by-step Result |
aa.mp4 |
aa.mp4 |
iEhCQX3a44k_000313.mp4 |
iEhCQX3a44k_000313.mp4 |
QnR9JhVGgu8_000348.mp4 |
QnR9JhVGgu8_000348.mp4 |
QnR9JhVGgu8_000348.mp4 |
JdzRzkTzmMw_000098.mp4 |
conda create -n deepsound-v1 python=3.10 -y
conda activate deepsound-v1
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0
pip install flash-attn==2.5.8 --no-build-isolation
pip install -r reqirments.txt
See MODELS.md.
# coding = utf-8
import sys
import os
project_root = os.path.dirname(os.path.abspath(__file__))
mmaudio_path = os.path.join(project_root, 'third_party', 'MMAudio')
sys.path.append(mmaudio_path)
from pipeline.pipeline import Pipeline
from third_party.MMAudio.mmaudio.eval_utils import setup_eval_logging
import os
from moviepy.editor import AudioFileClip, VideoFileClip
import torch
from pathlib import Path
import subprocess
import time
@torch.inference_mode()
def init_pipeline(step0_model_dir='pretrained/mllm/VideoLLaMA2.1-7B-AV-CoT',
step1_mode='mmaudio_medium_44k',
step2_model_dir='pretrained/mllm/VideoLLaMA2.1-7B-AV-CoT',
step2_mode='cot',
step3_mode='bs_roformer',):
st = time.time()
pipeline = Pipeline(
step0_model_dir=step0_model_dir,
step1_mode=step1_mode,
step2_model_dir=step2_model_dir,
step2_mode=step2_mode,
step3_mode=step3_mode,
)
et = time.time()
print(f"Initialize models time: {et - st:.2f} s.")
return pipeline
@torch.inference_mode()
def video_to_audio(pipeline: Pipeline, video_input, output_dir, mode='s4', postp_mode='neg',
prompt='', negative_prompt='', duration=10, skip_final_video=False):
st_infer = time.time()
step_results = pipeline.run(video_input=video_input,
output_dir=output_dir,
mode=mode,
postp_mode=postp_mode,
prompt=prompt,
negative_prompt=negative_prompt,
duration=duration)
temp_final_audio_path = step_results["temp_final_audio_path"]
temp_final_video_path = step_results["temp_final_video_path"]
final_audio_path = str(Path(output_dir).expanduser() / f'{Path(video_input).expanduser().stem}.wav')
final_video_path = str(Path(output_dir).expanduser() / f'{Path(video_input).expanduser().stem}.mp4')
if temp_final_audio_path is not None:
subprocess.run(['cp', str(temp_final_audio_path), final_audio_path], check=True)
step_results["final_audio_path"] = final_audio_path
if skip_final_video:
step_results["final_video_path"] = None
else:
if temp_final_video_path is not None:
subprocess.run(['cp', str(temp_final_video_path), final_video_path], check=True)
else:
audio = AudioFileClip(final_audio_path)
video = VideoFileClip(video_input)
duration = min(audio.duration, video.duration)
audio = audio.subclip(0, duration)
video.audio = audio
video = video.subclip(0, duration)
video.write_videofile(final_video_path)
step_results["final_video_path"] = final_video_path
et_infer = time.time()
print(f"Inference time: {et_infer - st_infer:.2f} s.")
return step_results
@torch.inference_mode()
def main():
setup_eval_logging()
video_input = "aa.mp4"
output_dir = "output"
os.makedirs(output_dir, exist_ok=True)
pipeline = init_pipeline()
step_results = video_to_audio(pipeline, video_input, output_dir)
print("step_results: ", step_results)
print(f"final_audio_path: {step_results['final_audio_path']}")
print(f"final_video_path: {step_results['final_video_path']}")
if __name__ == '__main__':
main()
The output (audio in .wav
format, and video in .mp4
format) will be saved in ./output
.
With demo.py
python demo.py -i <video_path>
Other parameters are here.
python gradio_demo.py
Refer av-benchmark for benchmarking results. See EVAL.md.
@misc{liang2025deepsoundv1startthinkstepbystep,
title={DeepSound-V1: Start to Think Step-by-Step in the Audio Generation from Videos},
author={Yunming Liang and Zihao Chen and Chaofan Ding and Xinhan Di},
year={2025},
eprint={2503.22208},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2503.22208},
}
Many thanks to:
- VideoLLaMA2 for MLLM CoT reasoning
- MMAudio for generate audio from video
- FoleyCrafter for generate audio from video
- BS-RoFormer for remove voice-over
- av-benchmark for benchmarking results