Skip to content

Commit 2f53feb

Browse files
committed
docs(assemblyai): add vad_threshold and remove legacy parameters
- Add vad_threshold parameter documentation for U3 Pro - Remove formatted_finals (v2 API legacy parameter) - Remove word_finalization_max_wait_time (v2 API legacy parameter) - Clarify format_turns only applies to Universal-Streaming models - Add VAD threshold alignment usage example and notes This corresponds to the code changes in pipecat-ai/pipecat#3927
1 parent 73f73b7 commit 2f53feb

1 file changed

Lines changed: 41 additions & 5 deletions

File tree

server/services/stt/assemblyai.mdx

Lines changed: 41 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -125,8 +125,6 @@ Connection-level parameters passed via the `connection_params` constructor argum
125125
| ---------------------------------------- | ----------- | -------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
126126
| `sample_rate` | `int` | `16000` | Audio sample rate in Hz. |
127127
| `encoding` | `Literal` | `"pcm_s16le"` | Audio encoding format. Options: `"pcm_s16le"`, `"pcm_mulaw"`. |
128-
| `formatted_finals` | `bool` | `True` | Whether to enable transcript formatting. |
129-
| `word_finalization_max_wait_time` | `int` | `None` | Maximum time to wait for word finalization in milliseconds. |
130128
| `end_of_turn_confidence_threshold` | `float` | `None` | Confidence threshold for end-of-turn detection. |
131129
| `min_turn_silence` | `int` | `None` | Minimum silence duration (ms) when confident about end-of-turn. |
132130
| `min_end_of_turn_silence_when_confident` | `int` | `None` | **DEPRECATED**. Use `min_turn_silence` instead. Will be removed in a future version. |
@@ -135,18 +133,23 @@ Connection-level parameters passed via the `connection_params` constructor argum
135133
| `prompt` | `str` | `None` | Optional text prompt to guide transcription. Only used when `speech_model` is `"u3-rt-pro"`. Cannot be used with `keyterms_prompt`. |
136134
| `speech_model` | `Literal` | `"u3-rt-pro"` | Speech model. Options: `"universal-streaming-english"`, `"universal-streaming-multilingual"`, `"u3-rt-pro"`. |
137135
| `language_detection` | `bool` | `None` | Enable automatic language detection. Only applicable to `universal-streaming-multilingual`. Turn messages include language information. |
138-
| `format_turns` | `bool` | `True` | Whether to format transcript turns. |
136+
| `format_turns` | `bool` | `True` | Whether to format transcript turns. Only applicable to `universal-streaming-english` and `universal-streaming-multilingual` models. For `u3-rt-pro`, formatting is automatic and built-in. |
139137
| `speaker_labels` | `bool` | `None` | Enable speaker diarization. Final transcripts include a speaker field (e.g., "Speaker A", "Speaker B"). |
138+
| `vad_threshold` | `float` | `None` | Voice activity detection confidence threshold. Only applicable to `u3-rt-pro`. The confidence threshold (0.0 to 1.0) for classifying audio frames as silence. Frames with VAD confidence below this value are considered silent. Increase for noisy environments to reduce false speech detection. Defaults to 0.3 (API default). For best performance when using with external VAD (e.g., Silero), align this value with your VAD's activation threshold. Defaults to `None` (not sent). |
140139

141140
## Usage
142141

143142
### Basic Setup
144143

145144
```python
146145
from pipecat.services.assemblyai import AssemblyAISTTService
146+
from pipecat.services.assemblyai.models import AssemblyAIConnectionParams
147147

148148
stt = AssemblyAISTTService(
149149
api_key=os.getenv("ASSEMBLYAI_API_KEY"),
150+
connection_params=AssemblyAIConnectionParams(
151+
speech_model="u3-rt-pro",
152+
),
150153
)
151154
```
152155

@@ -160,7 +163,6 @@ stt = AssemblyAISTTService(
160163
api_key=os.getenv("ASSEMBLYAI_API_KEY"),
161164
connection_params=AssemblyAIConnectionParams(
162165
sample_rate=16000,
163-
formatted_finals=True,
164166
keyterms_prompt=["Pipecat", "AssemblyAI"],
165167
speech_model="u3-rt-pro",
166168
),
@@ -206,9 +208,43 @@ stt = AssemblyAISTTService(
206208
)
207209
```
208210

211+
### With Aligned VAD Thresholds
212+
213+
When using AssemblyAI with an external VAD (e.g., Silero VAD), align both thresholds for optimal performance:
214+
215+
```python
216+
from pipecat.audio.vad.silero import SileroVADAnalyzer, VADParams
217+
from pipecat.processors.aggregators.llm_response_universal import (
218+
LLMContextAggregatorPair,
219+
LLMUserAggregatorParams,
220+
)
221+
from pipecat.services.assemblyai import AssemblyAISTTService
222+
from pipecat.services.assemblyai.models import AssemblyAIConnectionParams
223+
224+
# Align VAD thresholds to 0.3 for optimal performance
225+
stt = AssemblyAISTTService(
226+
api_key=os.getenv("ASSEMBLYAI_API_KEY"),
227+
connection_params=AssemblyAIConnectionParams(
228+
speech_model="u3-rt-pro",
229+
vad_threshold=0.3, # Match with Silero VAD
230+
),
231+
)
232+
233+
# Configure Silero VAD with matching threshold
234+
user_aggregator, assistant_aggregator = LLMContextAggregatorPair(
235+
context,
236+
user_params=LLMUserAggregatorParams(
237+
vad_analyzer=SileroVADAnalyzer(
238+
params=VADParams(confidence=0.3) # Match with AssemblyAI
239+
)
240+
),
241+
)
242+
```
243+
209244
## Notes
210245

211-
- **u3-rt-pro model**: The default model is now `u3-rt-pro`, which provides the best performance and supports built-in turn detection.
246+
- **u3-rt-pro model**: The default model is now `u3-rt-pro`, which provides the best performance and supports built-in turn detection.
247+
- **VAD threshold alignment**: When using AssemblyAI with external VAD systems (e.g., Silero VAD), align the `vad_threshold` parameter with your VAD's activation threshold (recommended: 0.3 for both) to avoid a "dead zone" where AssemblyAI transcribes speech that your VAD hasn't detected yet. This misalignment can delay interruption handling. AssemblyAI's `vad_threshold` defaults to 0.3, while Pipecat's Silero VAD defaults to 0.7 - we recommend lowering Silero to 0.3 to match.
212248
- **Turn detection modes**:
213249
- **Pipecat mode** (`vad_force_turn_endpoint=True`, default): Forces AssemblyAI to return finals ASAP so Pipecat's turn detection (e.g., Smart Turn) decides when the user is done. The service sends a `ForceEndpoint` message when VAD detects the user has stopped speaking.
214250
- **AssemblyAI mode** (`vad_force_turn_endpoint=False`, u3-rt-pro only): AssemblyAI's model controls turn endings using built-in turn detection. The service emits `UserStartedSpeakingFrame` and `UserStoppedSpeakingFrame` based on AssemblyAI's detection.

0 commit comments

Comments
 (0)