22
33## Exporting Whisper with Beam Search
44
5- There are two ways to export Whisper with beam search (using Whisper tiny as an example).
5+ There are several ways to export Whisper with beam search (using Whisper tiny as an example).
6+
7+ ### Option 1: from convert_to_onnx
68
7- Option 1: from source
89```
10+ # From source
911$ git clone https://github.com/microsoft/onnxruntime
10- $ cd onnxruntime/onnxruntime/python/tools/transformers/models/whisper
11- $ python3 convert_to_onnx.py -m openai/whisper-tiny --output whispertiny --use_external_data_format
12+ $ cd onnxruntime/onnxruntime/python/tools/transformers/
13+ $ python3 -m models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format
14+
15+ # From wheel
16+ $ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format
1217```
1318
14- Option 2: from wheel
19+ ### Option 2: end-to-end model from [ Olive] ( https://github.com/microsoft/Olive/tree/main/examples/whisper )
20+
21+ Please follow the [ README instructions] ( https://github.com/microsoft/Olive/tree/main/examples/whisper#prerequisites ) in Olive.
22+
23+ ### Option 3: from [ Hugging Face Optimum] ( https://github.com/huggingface/optimum )
24+
25+ Run the following Python code to export:
26+
1527```
16- $ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format
28+ from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
29+
30+ model_name = "openai/whisper-large-v2"
31+ model = ORTModelForSpeechSeq2Seq.from_pretrained(
32+ model_name,
33+ export=True,
34+ )
35+ model.save_pretrained(model_name.split("/")[-1] + "-onnx")
1736```
1837
1938## Exporting + Optimizing + Quantizing Whisper with Beam Search
@@ -23,7 +42,7 @@ Here are some additional examples for exporting Whisper with beam search.
2342Export with Forced Decoder Input Ids
2443```
2544# From source:
26- $ python3 convert_to_onnx.py -m openai/whisper-tiny --output whispertiny --use_external_data_format --use_forced_decoder_ids
45+ $ python3 -m models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format --use_forced_decoder_ids
2746
2847# From wheel:
2948$ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format --use_forced_decoder_ids
@@ -32,7 +51,7 @@ $ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/w
3251Export + Optimize for FP32
3352```
3453# From source:
35- $ python3 convert_to_onnx.py -m openai/whisper-tiny --output whispertiny --use_external_data_format --optimize_onnx --precision fp32
54+ $ python3 -m models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format --optimize_onnx --precision fp32
3655
3756# From wheel:
3857$ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format --optimize_onnx --precision fp32
@@ -41,7 +60,7 @@ $ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/w
4160Export + Optimize for FP16 and GPU
4261```
4362# From source:
44- $ python3 convert_to_onnx.py -m openai/whisper-tiny --output whispertiny --use_external_data_format --optimize_onnx --precision fp16 --use_gpu --provider cuda
63+ $ python3 -m models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format --optimize_onnx --precision fp16 --use_gpu --provider cuda
4564
4665# From wheel:
4766$ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format --optimize_onnx --precision fp16 --use_gpu --provider cuda
@@ -50,8 +69,128 @@ $ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/w
5069Export + Quantize for INT8
5170```
5271# From source:
53- $ python3 convert_to_onnx.py -m openai/whisper-tiny --output whispertiny --use_external_data_format --precision int8 --quantize_embedding_layer
72+ $ python3 -m models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format --precision int8 --quantize_embedding_layer
5473
5574# From wheel:
5675$ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format --precision int8 --quantize_embedding_layer
5776```
77+
78+ ## Benchmark Whisper
79+
80+ Here are some examples of how you can benchmark Whisper across various end-to-end (E2E) implementations.
81+
82+ Note: In the below examples, ` PyTorch ` refers to running in PyTorch without ` torch.compile ` and ` PyTorch 2.0 ` refers to running in PyTorch with ` torch.compile ` .
83+
84+ ### Variants
85+
86+ 1 . PyTorch (without ` torch.compile ` ), FP32
87+ ```
88+ python3 -m models.whisper.benchmark \
89+ --benchmark-type hf-pt \
90+ --audio-path 1272-141231-0002.mp3 \
91+ --model-name openai/whisper-large-v2 \
92+ --precision fp32 \
93+ --device cpu
94+ ```
95+
96+ 2 . PyTorch 2.0 (with ` torch.compile ` ), FP16
97+ ```
98+ python3 -m models.whisper.benchmark \
99+ --benchmark-type hf-pt2 \
100+ --audio-path 1272-141231-0002.mp3 \
101+ --model-name openai/whisper-large-v2 \
102+ --precision fp16 \
103+ --device cuda
104+ ```
105+
106+ 3 . Optimum + ONNX Runtime, FP32, export via Optimum
107+ ```
108+ python3 -m models.whisper.benchmark \
109+ --benchmark-type hf-ort \
110+ --audio-path 1272-141231-0002.mp3 \
111+ --model-name openai/whisper-large-v2 \
112+ --hf-ort-model-path ./whisper-large-v2-onnx/ \
113+ --precision fp32 \
114+ --device cpu
115+ ```
116+
117+ 4 . ONNX Runtime, FP32, export via Olive or convert_to_onnx
118+ ```
119+ python3 -m models.whisper.benchmark \
120+ --benchmark-type ort \
121+ --audio-path 1272-141231-0002.mp3 \
122+ --model-name openai/whisper-large-v2 \
123+ --ort-model-path ./wlarge-fp32/whisper-large-v2_beamsearch.onnx \
124+ --precision fp32 \
125+ --device cpu
126+ ```
127+
128+ 5 . ONNX Runtime, FP16, export via Olive or convert_to_onnx
129+ ```
130+ python3 -m models.whisper.benchmark \
131+ --benchmark-type ort \
132+ --audio-path 1272-141231-0002.mp3 \
133+ --model-name openai/whisper-large-v2 \
134+ --ort-model-path ./wlarge-fp32/whisper-large_all.onnx \
135+ --precision fp16 \
136+ --device cuda
137+ ```
138+
139+ 6 . ONNX Runtime, INT8, export via Olive or convert_to_onnx
140+ ```
141+ python3 -m models.whisper.benchmark \
142+ --benchmark-type ort \
143+ --audio-path 1272-141231-0002.mp3 \
144+ --model-name openai/whisper-large-v2 \
145+ --ort-model-path ./wlarge-fp32/whisper-large-v2_all.onnx \
146+ --precision fp32 \
147+ --device cpu
148+ ```
149+
150+ You can profile a variant by adding the ` --profile ` flag.
151+
152+ ### Benchmark All
153+
154+ You can use ` benchmark_all.py ` to benchmark across various platforms and automatically store the results in a CSV file. Here is an example.
155+
156+ ```
157+ python3 -m models.whisper.benchmark_all \
158+ --audio-path ./whisper-test-audios/ \
159+ --hf-ort-model-path ./whisper-large-v2-onnx/ \
160+ --ort-model-path ./wlarge-fp32/whisper-large-v2_all.onnx \
161+ --model-name openai/whisper-large-v2 \
162+ --precision fp32 \
163+ --device cpu
164+ ```
165+
166+ ### Benchmarking on NVIDIA A100
167+
168+ Here is a benchmark for an MP3 file with 20.7s of audio.
169+
170+ #### FP16
171+
172+ | Engine | Size | Per-Token Latency | Real-Time Factor |
173+ | ------------- | -------- | ----------------- | ---------------- |
174+ | PyTorch | Tiny | 4.697 ms/token | 0.004697 |
175+ | PyTorch 2.0 | Tiny | 3.406 ms/token | 0.003406 |
176+ | ONNX Runtime | Tiny | 0.746 ms/token | 0.000746 |
177+ | PyTorch | Medium | 17.837 ms/token | 0.017387 |
178+ | PyTorch 2.0 | Medium | 18.124 ms/token | 0.018124 |
179+ | ONNX Runtime | Medium | 3.894 ms/token | 0.003894 |
180+ | PyTorch | Large v2 | 23.470 ms/token | 0.023470 |
181+ | PyTorch 2.0 | Large v2 | 23.146 ms/token | 0.023146 |
182+ | ONNX Runtime | Large v2 | 6.262 ms/token | 0.006262 |
183+
184+ #### FP32
185+
186+ | Engine | Size | Per-Token Latency | Real-Time Factor |
187+ | ------------- | -------- | ----------------- | ---------------- |
188+ | PyTorch | Tiny | 6.220 ms/token | 0.006220 |
189+ | PyTorch 2.0 | Tiny | 3.944 ms/token | 0.003944 |
190+ | ONNX Runtime | Tiny | 1.545 ms/token | 0.001545 |
191+ | PyTorch | Medium | 19.093 ms/token | 0.019093 |
192+ | PyTorch 2.0 | Medium | 20.459 ms/token | 0.020459 |
193+ | ONNX Runtime | Medium | 9.440 ms/token | 0.009440 |
194+ | PyTorch | Large v2 | 25.844 ms/token | 0.025844 |
195+ | PyTorch 2.0 | Large v2 | 26.397 ms/token | 0.026397 |
196+ | ONNX Runtime | Large v2 | 7.492 ms/token | 0.007492 |
0 commit comments