You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add documentation for the vad_threshold parameter in AssemblyAI U3 Pro:
- Add vad_threshold to AssemblyAIConnectionParams table
- Add usage example showing VAD threshold alignment with Silero VAD
- Add note about VAD threshold alignment to avoid "dead zone"
- Explain the misalignment issue between AssemblyAI (default 0.3) and
Pipecat's Silero VAD (default 0.7)
This corresponds to the vad_threshold parameter added in pipecat-ai/pipecat#3927
Copy file name to clipboardExpand all lines: server/services/stt/assemblyai.mdx
+26-1Lines changed: 26 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -137,6 +137,7 @@ Connection-level parameters passed via the `connection_params` constructor argum
137
137
|`language_detection`|`bool`|`None`| Enable automatic language detection. Only applicable to `universal-streaming-multilingual`. Turn messages include language information. |
138
138
|`format_turns`|`bool`|`True`| Whether to format transcript turns. |
139
139
|`speaker_labels`|`bool`|`None`| Enable speaker diarization. Final transcripts include a speaker field (e.g., "Speaker A", "Speaker B"). |
140
+
|`vad_threshold`|`float`|`None`| Voice activity detection confidence threshold. Only applicable to `u3-rt-pro`. The confidence threshold (0.0 to 1.0) for classifying audio frames as silence. Frames with VAD confidence below this value are considered silent. Increase for noisy environments to reduce false speech detection. Defaults to 0.3 (API default). For best performance when using with external VAD (e.g., Silero), align this value with your VAD's activation threshold. Defaults to `None` (not sent). |
140
141
141
142
## Usage
142
143
@@ -206,9 +207,33 @@ stt = AssemblyAISTTService(
206
207
)
207
208
```
208
209
210
+
### With Aligned VAD Thresholds
211
+
212
+
When using AssemblyAI with an external VAD (e.g., Silero VAD), align both thresholds to avoid a "dead zone" where AssemblyAI transcribes speech that your VAD hasn't detected yet:
213
+
214
+
```python
215
+
from pipecat.services.assemblyai import AssemblyAISTTService
216
+
from pipecat.services.assemblyai.models import AssemblyAIConnectionParams
217
+
from pipecat.audio.vad.silero import SileroVADAnalyzer, VADParams
218
+
219
+
# Align both thresholds to 0.3 for optimal performance
220
+
stt = AssemblyAISTTService(
221
+
api_key=os.getenv("ASSEMBLYAI_API_KEY"),
222
+
connection_params=AssemblyAIConnectionParams(
223
+
speech_model="u3-rt-pro",
224
+
vad_threshold=0.3, # Match with Silero VAD
225
+
),
226
+
)
227
+
228
+
vad = SileroVADAnalyzer(
229
+
params=VADParams(confidence=0.3) # Match with AssemblyAI
230
+
)
231
+
```
232
+
209
233
## Notes
210
234
211
-
-**u3-rt-pro model**: The default model is now `u3-rt-pro`, which provides the best performance and supports built-in turn detection.
235
+
-**u3-rt-pro model**: The default model is now `u3-rt-pro`, which provides the best performance and supports built-in turn detection.
236
+
-**VAD threshold alignment**: When using AssemblyAI with external VAD systems (e.g., Silero VAD), align the `vad_threshold` parameter with your VAD's activation threshold (recommended: 0.3 for both) to avoid a "dead zone" where AssemblyAI transcribes speech that your VAD hasn't detected yet. This misalignment can delay interruption handling. AssemblyAI's `vad_threshold` defaults to 0.3, while Pipecat's Silero VAD defaults to 0.7 - we recommend lowering Silero to 0.3 to match.
212
237
-**Turn detection modes**:
213
238
-**Pipecat mode** (`vad_force_turn_endpoint=True`, default): Forces AssemblyAI to return finals ASAP so Pipecat's turn detection (e.g., Smart Turn) decides when the user is done. The service sends a `ForceEndpoint` message when VAD detects the user has stopped speaking.
214
239
-**AssemblyAI mode** (`vad_force_turn_endpoint=False`, u3-rt-pro only): AssemblyAI's model controls turn endings using built-in turn detection. The service emits `UserStartedSpeakingFrame` and `UserStoppedSpeakingFrame` based on AssemblyAI's detection.
0 commit comments