Skip to content

Commit 795d2ed

Browse files
committed
fix(parakeet-cpp): default batch_max_size to 1 (batching opt-in)
Dynamic batching now defaults off (batch_max_size:1, one request at a time). Raise batch_max_size to opt in: it is a large throughput win on GPU under concurrent load, but on CPU and low-concurrency setups it only adds latency, so off is the safer default. The startup log now states whether batching is on or off, and the audio-to-text docs are updated to match. Assisted-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
1 parent c1bb48e commit 795d2ed

2 files changed

Lines changed: 16 additions & 9 deletions

File tree

backend/go/parakeet-cpp/goparakeetcpp.go

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -121,10 +121,12 @@ func (p *ParakeetCpp) Load(opts *pb.ModelOptions) error {
121121
}
122122
p.ctxPtr = ctx
123123

124-
// Dynamic batching knobs (model YAML options:, key:value form). On GPU,
125-
// coalescing concurrent requests into one batched engine call improves
126-
// throughput; set batch_max_size:1 to disable (recommended on CPU).
127-
maxSize := optInt(opts, "batch_max_size", 8)
124+
// Dynamic batching knobs (model YAML options:, key:value form). Batching is
125+
// OFF by default (batch_max_size:1): each request runs on its own. On GPU,
126+
// raising batch_max_size coalesces concurrent requests into one batched
127+
// engine call and improves throughput under load; leave it at 1 on CPU and
128+
// for low-concurrency setups, where batching only adds latency.
129+
maxSize := optInt(opts, "batch_max_size", 1)
128130
maxWaitMs := optInt(opts, "batch_max_wait_ms", 15)
129131
if maxWaitMs < 0 {
130132
maxWaitMs = 0
@@ -133,8 +135,13 @@ func (p *ParakeetCpp) Load(opts *pb.ModelOptions) error {
133135
p.batStop = make(chan struct{})
134136
p.bat = newBatcher(maxSize, time.Duration(maxWaitMs)*time.Millisecond, p.runBatch)
135137
go p.bat.run(p.batStop) // dispatcher runs until Free closes batStop
136-
xlog.Info("parakeet-cpp: dynamic batching enabled",
137-
"batch_max_size", maxSize, "batch_max_wait_ms", maxWaitMs)
138+
if maxSize > 1 {
139+
xlog.Info("parakeet-cpp: dynamic batching enabled",
140+
"batch_max_size", maxSize, "batch_max_wait_ms", maxWaitMs)
141+
} else {
142+
xlog.Info("parakeet-cpp: dynamic batching off (batch_max_size=1); " +
143+
"set batch_max_size>1 to coalesce concurrent requests on GPU")
144+
}
138145
} else {
139146
xlog.Info("parakeet-cpp: batched C-API not present in libparakeet.so; " +
140147
"batching disabled, using per-request transcription")

docs/content/features/audio-to-text.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -189,19 +189,19 @@ For real-time use, load a cache-aware streaming model (e.g. `realtime_eou_120m-v
189189

190190
### Dynamic batching
191191

192-
The backend coalesces concurrent transcription requests into a single batched engine call, which improves throughput on GPU when many requests arrive at once. Two `options:` knobs control it:
192+
The backend can coalesce concurrent transcription requests into a single batched engine call, which improves throughput on GPU when many requests arrive at once. Batching is **off by default** (`batch_max_size:1`, one request at a time); raise it to opt in. Two `options:` knobs control it:
193193

194194
```yaml
195195
name: parakeet-110m
196196
backend: parakeet-cpp
197197
parameters:
198198
model: tdt_ctc-110m-f16.gguf
199199
options:
200-
- batch_max_size:8 # max requests coalesced into one batch (default 8)
200+
- batch_max_size:8 # max requests coalesced into one batch (default 1 = off)
201201
- batch_max_wait_ms:15 # how long to wait to fill a batch, in ms (default 15)
202202
```
203203

204-
Set `batch_max_size:1` to disable batching (requests run one at a time). This is recommended on CPU, where batching does not help and only adds latency. Batching only affects concurrent unary requests; streaming sessions always run on their own.
204+
By default each request runs on its own. Raise `batch_max_size` (for example 4 to 16) to enable batching; it pays off on GPU under concurrent load, where coalescing the per-step decode GEMMs across requests is a large throughput win. Leave it at 1 on CPU and for low-concurrency setups, where batching only adds latency. Batching only affects concurrent unary requests; streaming sessions always run on their own.
205205

206206
## See also
207207

0 commit comments

Comments
 (0)