You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Two new runtime-config fields, end to end (DB migrations 16/17 → runtimecfg →
admin API → openapi → dashboard):
- chunk_max_concurrent — the wasm chunker's instance-concurrency cap,
decoupled from embedding concurrency; resizes the live limiter without a
restart. Env: CIX_CHUNK_MAX_CONCURRENT; per-instance memory knobs stay
env-only (CIX_CHUNK_MEM_LIMIT_PAGES, CIX_CHUNK_RECYCLE_GROWTH_MB,
CIX_CHUNK_MAX_IDLE).
- llama_cache_ram_mib — llama-server's HOST prompt cache cap (--cache-ram).
Upstream defaults this to 8 GiB (ggml-org/llama.cpp#16391), which is pure
waste for an embeddings-only sidecar: prompts are never reused, but the
cache fills anyway. Observed on prod: llama-server RSS 365MB→11.3GB within
minutes of indexing vscode@main, then cgroup OOM kill — twice at the 10G
limit, again at 16G. With --cache-ram 0 (our default; -1 = unlimited) it
plateaus at ~900MB under the same load. Env: CIX_LLAMA_CACHE_RAM; shown in
the dashboard's Runtime parameters card, applied via Save & Restart.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: doc/openapi.yaml
+21Lines changed: 21 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -3245,6 +3245,8 @@ components:
3245
3245
- max_embedding_concurrency
3246
3246
- llama_batch_size
3247
3247
- index_embed_batch_chunks
3248
+
- chunk_max_concurrent
3249
+
- llama_cache_ram_mib
3248
3250
- source
3249
3251
properties:
3250
3252
embedding_model:
@@ -3270,6 +3272,14 @@ components:
3270
3272
type: integer
3271
3273
minimum: 0
3272
3274
description: Cross-file embed-batch size for repo indexing (chunks per embed call). 0 = one call per file.
3275
+
chunk_max_concurrent:
3276
+
type: integer
3277
+
minimum: 0
3278
+
description: Chunker (tree-sitter wasm) instance-concurrency cap, decoupled from embedding concurrency. Each instance holds ~69 MiB. 0 = recommended (3).
3279
+
llama_cache_ram_mib:
3280
+
type: integer
3281
+
minimum: -1
3282
+
description: llama-server host prompt-cache cap in MiB (--cache-ram). 0 = disabled (recommended for embeddings — prompts are never reused, and llama's upstream 8 GiB default grows RSS until the container OOMs), -1 = unlimited.
3273
3283
source:
3274
3284
type: object
3275
3285
additionalProperties:
@@ -3301,6 +3311,8 @@ components:
3301
3311
- max_embedding_concurrency
3302
3312
- llama_batch_size
3303
3313
- index_embed_batch_chunks
3314
+
- chunk_max_concurrent
3315
+
- llama_cache_ram_mib
3304
3316
properties:
3305
3317
embedding_model: { type: string }
3306
3318
llama_ctx_size: { type: integer }
@@ -3309,6 +3321,8 @@ components:
3309
3321
max_embedding_concurrency: { type: integer }
3310
3322
llama_batch_size: { type: integer }
3311
3323
index_embed_batch_chunks: { type: integer }
3324
+
chunk_max_concurrent: { type: integer }
3325
+
llama_cache_ram_mib: { type: integer }
3312
3326
3313
3327
RuntimeConfigUpdate:
3314
3328
type: object
@@ -3339,6 +3353,13 @@ components:
3339
3353
index_embed_batch_chunks:
3340
3354
type: integer
3341
3355
nullable: true
3356
+
chunk_max_concurrent:
3357
+
type: integer
3358
+
nullable: true
3359
+
llama_cache_ram_mib:
3360
+
type: integer
3361
+
nullable: true
3362
+
description: MiB; 0 clears the override (falls back to env / recommended = disabled), -1 = unlimited.
@@ -64,9 +66,11 @@ export function RuntimeParamsSection({
64
66
draftCtx,
65
67
draftGpuLayers,
66
68
draftThreads,
69
+
draftCacheRAM,
67
70
onDraftCtx,
68
71
onDraftGpuLayers,
69
72
onDraftThreads,
73
+
onDraftCacheRAM,
70
74
}: Props){
71
75
constrec=config?.recommended;
72
76
constsrc=config?.source;
@@ -109,6 +113,16 @@ export function RuntimeParamsSection({
109
113
source={src?.llama_n_threads}
110
114
onChange={onDraftThreads}
111
115
/>
116
+
<NumberField
117
+
field="llama_cache_ram_mib"
118
+
label="Host prompt cache (MiB)"
119
+
hint="llama-server's in-RAM prompt cache (--cache-ram). Embeddings never reuse prompts, so 0 (disabled) is right for cix — llama's own 8192 MiB default only inflates RSS until the container hits its memory limit. -1 = unlimited."
0 commit comments