fix(parakeet-cpp): default batch_max_size to 1 (batching opt-in)

mudler · mudler · commit 795d2ed53948 · 2026-06-01T12:53:40.000Z
Dynamic batching now defaults off (batch_max_size:1, one request at a
time). Raise batch_max_size to opt in: it is a large throughput win on
GPU under concurrent load, but on CPU and low-concurrency setups it only
adds latency, so off is the safer default. The startup log now states
whether batching is on or off, and the audio-to-text docs are updated to
match.

Assisted-by: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
Signed-off-by: Ettore Di Giacinto &lt;mudler@localai.io&gt;
diff --git a/backend/go/parakeet-cpp/goparakeetcpp.go b/backend/go/parakeet-cpp/goparakeetcpp.go
@@ -121,10 +121,12 @@ func (p *ParakeetCpp) Load(opts *pb.ModelOptions) error {
 	}
 	p.ctxPtr = ctx
 
-	// Dynamic batching knobs (model YAML options:, key:value form). On GPU,
-	// coalescing concurrent requests into one batched engine call improves
-	// throughput; set batch_max_size:1 to disable (recommended on CPU).
-	maxSize := optInt(opts, "batch_max_size", 8)
+	// Dynamic batching knobs (model YAML options:, key:value form). Batching is
+	// OFF by default (batch_max_size:1): each request runs on its own. On GPU,
+	// raising batch_max_size coalesces concurrent requests into one batched
+	// engine call and improves throughput under load; leave it at 1 on CPU and
+	// for low-concurrency setups, where batching only adds latency.
+	maxSize := optInt(opts, "batch_max_size", 1)
 	maxWaitMs := optInt(opts, "batch_max_wait_ms", 15)
 	if maxWaitMs < 0 {
 		maxWaitMs = 0
@@ -133,8 +135,13 @@ func (p *ParakeetCpp) Load(opts *pb.ModelOptions) error {
 		p.batStop = make(chan struct{})
 		p.bat = newBatcher(maxSize, time.Duration(maxWaitMs)*time.Millisecond, p.runBatch)
 		go p.bat.run(p.batStop) // dispatcher runs until Free closes batStop
-		xlog.Info("parakeet-cpp: dynamic batching enabled",
-			"batch_max_size", maxSize, "batch_max_wait_ms", maxWaitMs)
+		if maxSize > 1 {
+			xlog.Info("parakeet-cpp: dynamic batching enabled",
+				"batch_max_size", maxSize, "batch_max_wait_ms", maxWaitMs)
+		} else {
+			xlog.Info("parakeet-cpp: dynamic batching off (batch_max_size=1); " +
+				"set batch_max_size>1 to coalesce concurrent requests on GPU")
+		}
 	} else {
 		xlog.Info("parakeet-cpp: batched C-API not present in libparakeet.so; " +
 			"batching disabled, using per-request transcription")
diff --git a/docs/content/features/audio-to-text.md b/docs/content/features/audio-to-text.md
@@ -189,19 +189,19 @@ For real-time use, load a cache-aware streaming model (e.g. `realtime_eou_120m-v
 
 ### Dynamic batching
 
-The backend coalesces concurrent transcription requests into a single batched engine call, which improves throughput on GPU when many requests arrive at once. Two `options:` knobs control it:
+The backend can coalesce concurrent transcription requests into a single batched engine call, which improves throughput on GPU when many requests arrive at once. Batching is **off by default** (`batch_max_size:1`, one request at a time); raise it to opt in. Two `options:` knobs control it:
 
 ```yaml
 name: parakeet-110m
 backend: parakeet-cpp
 parameters:
   model: tdt_ctc-110m-f16.gguf
 options:
-- batch_max_size:8      # max requests coalesced into one batch (default 8)
+- batch_max_size:8      # max requests coalesced into one batch (default 1 = off)
 - batch_max_wait_ms:15  # how long to wait to fill a batch, in ms (default 15)
 ```
 
-Set `batch_max_size:1` to disable batching (requests run one at a time). This is recommended on CPU, where batching does not help and only adds latency. Batching only affects concurrent unary requests; streaming sessions always run on their own.
+By default each request runs on its own. Raise `batch_max_size` (for example 4 to 16) to enable batching; it pays off on GPU under concurrent load, where coalescing the per-step decode GEMMs across requests is a large throughput win. Leave it at 1 on CPU and for low-concurrency setups, where batching only adds latency. Batching only affects concurrent unary requests; streaming sessions always run on their own.
 
 ## See also