Skip to content

Commit 447708f

Browse files
committed
feat(realtime): EOU-driven semantic_vad turn detection
Add a `semantic_vad` turn-detection mode to the realtime API that feeds the transcription model live and decides "the user finished speaking" from the `<EOU>` end-of-utterance token rather than from silence alone. When EOU fires the turn commits immediately (~0.3s); otherwise it falls back to an eagerness-scaled silence threshold (low/med/high = 8/4/2s). Plumbing, bottom to top: - proto: `AudioTranscriptionLive` bidirectional RPC (config-first oneof, mono float PCM @16k, ready-ack / Unimplemented degrade signal) plus `TranscriptResult.eou` for the unary retranscribe gate. - pkg/grpc: client/server/base/embed scaffolding for the bidi stream, modeled on AudioTransformStream; release stream conns on terminal Recv. - parakeet-cpp: live transcription RPC with per-C-call engine locking (one live stream per turn, finalize+free at commit); bump parakeet.cpp to ABI v5 — incremental StreamingMel (no more quadratic per-feed mel recompute that delayed EOU on long turns) and the <EOU>/<EOB> split; strip the literal <EOU>/<EOB> from offline text and set Eou. - core/backend: LiveTranscriptionSession wrapper + pipeline `turn_detection:` config block (type/eagerness/retranscribe). - realtime: semantic_vad integration — live input captions streamed as transcription deltas while the user speaks, EOU-immediate commit with eagerness fallback, optional retranscribe gate (batch re-decode must also end in <EOU> to confirm), clause synthesis off the LLM token callback, and per-turn live-transcription / model_load telemetry. - UI: show the realtime pipeline components as a vertical list. Docs and tests included; opt-in via the pipeline YAML or per-session `session.update`. Non-streaming STT backends degrade to silence-only. Assisted-by: Claude Code:claude-opus-4-8 [Read] [Edit] [Write] [Bash] Assisted-by: Claude Code:claude-fable-5 [Read] [Edit] [Bash] Signed-off-by: Richard Palethorpe <io@richiejp.com>
1 parent 62c99c1 commit 447708f

49 files changed

Lines changed: 4107 additions & 255 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

backend/backend.proto

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,18 @@ service Backend {
1818
rpc GenerateVideo(GenerateVideoRequest) returns (Result) {}
1919
rpc AudioTranscription(TranscriptRequest) returns (TranscriptResult) {}
2020
rpc AudioTranscriptionStream(TranscriptRequest) returns (stream TranscriptStreamResponse) {}
21+
// AudioTranscriptionLive is the bidirectional live-microphone ASR RPC. The
22+
// first message MUST carry a Config; subsequent messages carry Audio frames
23+
// (mono float PCM at config.sample_rate, 16 kHz default). After a
24+
// successful open the backend replies with a single ready ack
25+
// (TranscriptLiveResponse{ready:true}); backends or models without
26+
// cache-aware streaming support return UNIMPLEMENTED instead. Newly
27+
// finalized text streams back as deltas; eou=true marks the model's
28+
// end-of-utterance token. One stream spans many utterances (the decoder
29+
// resets itself after each EOU). Closing the send side finalizes: the
30+
// backend flushes the decoder tail and emits a terminal message carrying
31+
// final_result. A second Config mid-stream resets the decode session.
32+
rpc AudioTranscriptionLive(stream TranscriptLiveRequest) returns (stream TranscriptLiveResponse) {}
2133
rpc TTS(TTSRequest) returns (Result) {}
2234
rpc TTSStream(TTSRequest) returns (stream Reply) {}
2335
rpc SoundGeneration(SoundGenerationRequest) returns (Result) {}
@@ -479,13 +491,45 @@ message TranscriptResult {
479491
string text = 2;
480492
string language = 3;
481493
float duration = 4;
494+
// True when the decode ended on the model's end-of-utterance special token
495+
// (<EOU>/<EOB>, emitted by cache-aware streaming models such as
496+
// parakeet_realtime_eou_120m-v1). The marker itself is stripped from text.
497+
bool eou = 5;
482498
}
483499

484500
message TranscriptStreamResponse {
485501
string delta = 1;
486502
TranscriptResult final_result = 2;
487503
}
488504

505+
// === AudioTranscriptionLive messages =====================================
506+
507+
message TranscriptLiveRequest {
508+
oneof payload {
509+
TranscriptLiveConfig config = 1;
510+
TranscriptLiveAudio audio = 2;
511+
}
512+
}
513+
514+
message TranscriptLiveConfig {
515+
string language = 1; // "" => model default
516+
int32 sample_rate = 2; // 0 => 16000; backends may reject others
517+
map<string, string> params = 3; // backend-specific tuning
518+
}
519+
520+
message TranscriptLiveAudio {
521+
repeated float pcm = 1; // mono PCM in [-1,1] at config.sample_rate
522+
}
523+
524+
message TranscriptLiveResponse {
525+
bool ready = 1; // open ack: sent once, before any delta
526+
string delta = 2; // newly-finalized text since previous response
527+
bool eou = 3; // <EOU> fired during this feed (the user yielded the turn)
528+
repeated TranscriptWord words = 4; // words finalized by this feed (stream-relative ns)
529+
TranscriptResult final_result = 5; // terminal message only, after the send side closes
530+
bool eob = 6; // <EOB> fired: a backchannel ("uh-huh") ended — NOT a turn boundary
531+
}
532+
489533
message TranscriptWord {
490534
int64 start = 1;
491535
int64 end = 2;

backend/go/parakeet-cpp/Makefile

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,10 @@
1515
# That's what the L0 smoke test uses. The default target below does the
1616
# proper clone-at-pin + cmake build so CI doesn't need a side-checkout.
1717

18+
# ABI v5: incremental StreamingMel (live feeds no longer recompute the full mel
19+
# per call, which fell behind real time and delayed <EOU> by seconds on long
20+
# turns) plus the <EOU>/<EOB> split (eou_out bitmask + JSON "eob" field) so
21+
# backchannels are not mistaken for turn boundaries.
1822
PARAKEET_VERSION?=db755a78d39f789bb7d4e3935158a9e8105dbe36
1923
PARAKEET_REPO?=https://github.com/mudler/parakeet.cpp
2024

0 commit comments

Comments
 (0)