Operational tests (verify / verify-full / verify-stress / bench / soak-test) tell you whether a compose serves correctly. They don't tell you whether the model behaves correctly — whether tool calls land on the right functions, whether instruction-follow constraints hold, whether structured-output stays valid JSON. A compose can pass every operational layer and still ship with degraded behavioral quality from quantization drift or a Genesis env-var flip.
scripts/quality-test.sh closes this gap. It wraps benchlocal-cli — a CLI port of BenchLocal bench packs — and runs verifier-backed scenarios against the running compose endpoint.
verify.sh — fast smoke (15s, "does it serve")
verify-full.sh — functional (1-2min, "does everything work")
verify-stress.sh — boundary (5-10min, "does it survive stress")
bench.sh — throughput (3-5min, "what's the TPS")
quality-test.sh — behavioral (10-90min, "does it produce useful output") ← THIS
soak-test.sh — stability (30-60min, "does it stay healthy over time")
Each layer has a different question. Quality testing is the one that catches "passed every other gate but produces wrong tool calls or violates format constraints."
Five deterministic packs (verifier-backed, no LLM-as-judge — these run without Docker):
| Pack | Dimension | Why it matters for club-3090 users |
|---|---|---|
| ToolCall-15 | Tool selection + argument correctness | IDE-agent traffic (Cline / OpenCode / Cursor) is 100% tool calls. Genesis env flips like P68/P69 cause silent-empty regressions that die here. |
| InstructFollow-15 | Constraint-heavy instruction compliance | Catches "ignore the format constraint" drift from cudagraph mode changes or sampling tweaks. |
| StructOutput-15 | JSON / YAML / markdown structure validity | Bounded-thinking, JSON tool args, FSM-constrained reasoning. |
| ReasonMath-15 | Numeric reasoning | Code-reasoning correctness; Q4-quant drift surfaces here first. |
| DataExtract-15 | Field-level extraction accuracy | RAG / document-Q&A workloads. |
Three sandboxed packs add execution-backed verification via Docker sandboxes. They're included in --full (need Docker; --no-sandboxed skips them):
| Pack | Verifier | Why it matters for club-3090 users |
|---|---|---|
| BugFind-15 | Candidate-fix execution sandbox | Code-repair quality + trap-scenario discipline (no false "found a bug"). |
| HermesAgent-20 | Multi-tool agent harness (browser / cron / memory / artifact mocks) | Multi-step agentic workflows — chained tool calls, recall, delegation. Closest proxy for IDE-agent stacks. |
| CLI-40 | Linux command-exec sandbox | Shell/CLI agent tasks (terminal agents like Claude Code / opencode). |
A separate eval-expansion pack, AiderPolyglot-30 (multi-language code editing across cpp/go/java/js/python/rust), runs independently — not bundled into --quick/--medium/--full. Drive it via benchlocal-cli run --pack aider-polyglot-30 --enable-sandboxed-packs, or as the aider leg of rebench-full.sh.
The reasoning suite is also separate from --full; run it with --reasoning when you specifically want code/math/science reasoning signal under thinking-on pack defaults:
| Pack | Verifier | Why it matters for club-3090 users |
|---|---|---|
| HumanEval+-30 | Code execution sandbox over HumanEval+ functional tests | Small Python coding tasks; catches code-reasoning regressions quickly. |
| LiveCodeBench-v6-30 | Code execution sandbox over public LCB functional tests | Harder post-2025 coding tasks; exposes budget runaway and algorithmic failures. |
| GSM-Symbolic-30 | Deterministic answer_match exact numeric scoring |
Symbolic grade-school math without LLM-as-judge. |
| GPQA-Diamond | Deterministic answer_match exact letter scoring |
Science QA placeholder; gated metadata-only until dataset access is materialized, so it reports dataset-unavailable instead of committing restricted data. |
| Mode | Packs | Budget | When to run |
|---|---|---|---|
--quick |
ToolCall + InstructFollow (2) | ~10-15 min | Per-commit gate; pre-push smoke. The two packs that catch the highest-value regressions for IDE-agent users. No Docker. |
--medium (default) |
+ StructOutput + DataExtract + ReasonMath (5) | ~25-30 min | Pre-release; pin bumps; new compose authoring. Generates the Quality: line for the compose schema. No Docker. |
--full |
+ BugFind + HermesAgent + CLI (8) | ~45-60 min | Cross-rig comparison; quality A/B vs another quant. The 3 added packs are Docker-sandboxed — needs Docker. |
--reasoning |
HumanEval+ + LiveCodeBench v6 + GSM-Symbolic + GPQA-Diamond metadata (4) | ~30-90+ min | Dedicated reasoning/code suite. Thinking defaults on for all 4 packs; HumanEval+ and LCB need Docker. |
--full runs the sandbox packs by default. --no-sandboxed drops --full back to the 5-pack deterministic scope (no Docker); --sandboxed-only runs just the 3 sandbox packs. --reasoning is independent of --full; use it for the four reasoning packs, with GPQA skipped until gated data is available.
⚠️ --fullneeds the sandbox images built first — "needs Docker" isn't enough. The 3 sandboxed packs run inside pre-builtbenchlocal-sandbox-*Docker images that are not auto-pulled, and the build tooling is not in thepip install— it lives in a benchlocal-cli checkout. Build them once:git clone https://github.com/noonghunna/benchlocal-cli bash benchlocal-cli/tools/build-sandboxes.sh # ~30 GB free; the aider image is biggest — `docker system prune` if tightThen
--fullworks. Without the images,quality-test.shwarns up front and runs the deterministic packs only; for a clean no-Docker run use--medium(or--no-sandboxed). (club-3090 #492 — benchlocal-cli's own mid-run hint pointed at a relative path that's wrong outside a checkout.)
pip install git+https://github.com/noonghunna/benchlocal-cli.gitOr for development from a local clone of benchlocal-cli:
pip install -e /path/to/benchlocal-cliLive progress is on by default. The wrapper forwards
--progressto benchlocal-cli, so per-scenario[N/M] <pack> <id> …lines stream to stderr as the run advances. Long modes (--full~30–40 min,--reasoningsimilar,--pack aider-polyglot-30~25–30 min) otherwise go dark for the whole duration with no signal whether anything is wrong mid-run. Pass--no-progress(orPROGRESS=0) for CI / log-volume-sensitive contexts.
# default --medium against the auto-detected running compose
bash scripts/quality-test.sh
# faster mode (per-commit gate)
bash scripts/quality-test.sh --quick
# full mode (pin bumps, cross-rig comparison)
bash scripts/quality-test.sh --full
# dedicated reasoning suite (thinking-on pack defaults; code packs need Docker)
bash scripts/quality-test.sh --reasoning
# explicit endpoint override
URL=http://localhost:8011 bash scripts/quality-test.sh --quick
# suppress live [N/M] progress for CI / log-volume contexts
bash scripts/quality-test.sh --full --no-progress
# --full includes the 3 Docker-sandboxed packs by default (BugFind/HermesAgent/CLI) — needs Docker
bash scripts/quality-test.sh --full
# skip the sandbox packs (drops --full to the 5-pack deterministic scope, no Docker)
bash scripts/quality-test.sh --full --no-sandboxed
# run ONLY the 3 sandbox packs
bash scripts/quality-test.sh --sandboxed-only
# run individual reasoning packs
bash scripts/quality-test.sh --pack humaneval-plus-30 --enable-thinking --thinking-max-tokens 16384 --timeout-per-case 300
bash scripts/quality-test.sh --pack lcb-v6-30 --enable-thinking --thinking-max-tokens 16384 --timeout-per-case 300
bash scripts/quality-test.sh --pack gsm-symbolic-30
bash scripts/quality-test.sh --pack gpqa-diamondOutput:
- Markdown table to stdout — paste-ready for BENCHMARKS quality rows
- JSON to
results/quality/quality-<timestamp>.json— full per-scenario detail for delta tracking - One-liner suitable for the compose
Quality:schema field — paste into compose YAML header
Example output:
=== benchlocal-cli --medium (endpoint: http://localhost:8020, model: qwen3.6-27b) ===
Pack | Pass / Total | Score | p50 latency | p95 latency | Status
ToolCall-15 (v1.0.1) | 14 / 15 | 93% | 8.2s | 12.1s | ✅
InstructFollow-15 (v1.0.0) | 13 / 15 | 87% | 11.4s | 17.8s | ✅
StructOutput-15 (v1.0.0) | 15 / 15 | 100% | 6.9s | 9.2s | ✅
DataExtract-15 (v1.0.0) | 12 / 15 | 80% | 7.3s | 10.5s | ✅
ReasonMath-15 (v1.0.0) | 11 / 15 | 73% | 14.2s | 22.6s | ✅
─────────────────────────|──────────────|───────|─────────────|─────────────|──────
TOTAL | 65 / 75 | 87% | | |
Failure breakdown:
- toolcall-15 TC-07: verifier_fail (wrong arg value for "filename": expected report.pdf, got output.pdf)
- instructfollow-15 IF-03: verifier_fail (word count 247, target 250 ±5)
- dataextract-15 DE-05: verifier_fail (7/14 atomic fields correct (50%). product_name: mismatch | product_price_paid: expected number)
- reasonmath-15 RM-09: verifier_fail (expected 42, got 45)
==========================================================================
Quality: line for compose schema field (paste into compose YAML header):
==========================================================================
Quality: ToolCall-15 14/15 (93%) · InstructFollow-15 13/15 (87%) · StructOutput-15 15/15 (100%) · DataExtract-15 12/15 (80%) · ReasonMath-15 11/15 (73%) (--medium, packs v1.0.x, 2026-05-09)
Failure reasons are surfaced in three places, cheapest first:
| Need | Where |
|---|---|
| Why a scenario failed (reason + detail), run just finished | The Failure breakdown: block at the end of every run — pack scenario: failure_mode (detail), full detail string. No extra command. |
| Same, but the run scrolled away / an older run | results/quality/quality-<ts>.json (raw), or benchlocal-cli inspect <json> --failed |
| The full prompt / response / verifier trace behind a failure | benchlocal-cli inspect <json> --scenario <ID> --full |
| Filter by failure type · compare two runs · per-scenario tokens + latency | benchlocal-cli inspect <json> --mode timeout · --diff prev.json |
failure_mode is one of: verifier_fail (answer wrong / below threshold) · timeout · agent_runner_timeout / agent_runner_crashed (sandboxed agentic packs) · server_error / http_error / model_endpoint_unreachable (serving problem, not a quality signal) · result_json_malformed · wrong_answer · verifier_not_implemented (stub, excluded from scoring).
The breakdown is terminal-only — quality-test.sh does not tee it to a log file, but the same data persists in the saved JSON.
quality-test.sh forwards to benchlocal-cli, which sizes each scenario's timeout automatically — you rarely need to set one. Precedence (highest wins):
- Manual —
--timeout-per-case N(orTIMEOUT_PER_CASE=N): used verbatim. - Auto-scaling (default) — the budget scales by the endpoint's measured decode speed and, for thinking-on runs, by the thinking-token budget. A one-shot startup probe measures the rig's decode TPS (and fails fast if the endpoint is unreachable, rather than hanging). The scaling deliberately over-budgets — a timeout is a safety ceiling, not a target — which is what keeps thinking-on packs from spuriously timing out. Exact formula + flags (
--measured-tps/--reference-tps/--retry-on-timeout): benchlocal-cli README → Per-case timeouts. - Static default — the pack's built-in
default_max_seconds.
Don't hand-set --timeout-per-case to "fix" a slow run unless you've confirmed the auto-probe measured wrong — the over-budget is intentional.
Planned (not yet built): an opt-in tier that sizes timeouts from a soak-derived per-context-depth TPS curve — a real "how fast at depth X" measurement for your exact rig/config, captured into the runtime measurement-record — instead of the single empty-context startup probe. It would be strictly opt-in and fall back to the auto-probe/default when no curve exists; measured data is never required. Tracked at #114.
By default the packs sample at temperature 0 (greedy) — deterministic and reproducible, so scores are comparable across rigs and across runs. This is the canonical baseline, and it's what regression tracking and cross-config ranking should use.
Two opt-in modes evaluate a model at a non-zero / model-recommended temperature instead. Both tag the run non-canonical (markdown header + saved JSON) and refuse to gate CI:
| Mode | What it does | When to use |
|---|---|---|
--sampling-from-server |
Omits all sampling params from requests, so the server applies its compose-configured defaults; reads them back from /props (llama.cpp) and records them. The compose is the single source of truth. |
"Evaluate the model exactly as it's served." |
benchlocal-cli … --temperature N (+ --top-p / --top-k / --min-p / --repeat-penalty) |
Eval at sampling values you specify. Mutually exclusive with --sampling-from-server. |
When you know the model's recommended temp and want it explicit and recorded. |
The composes ship model-recommended sampling defaults (Qwen3.6 0.6, Qwopus3.6 0.8, Gemma 1.0), set via the TEMP / TEMPERATURE / TOP_P / TOP_K / MIN_P / REPEAT_PENALTY env (see .env.example). --sampling-from-server inherits whatever the running compose declares — so "serve at the recommended temp" and "eval at the recommended temp" stay in sync from one source.
# canonical (default): temp 0, reproducible — use for ranking + regression tracking
bash scripts/quality-test.sh --full
# evaluate at the model's served / recommended temperature (inherits the compose default)
bash scripts/quality-test.sh --full --sampling-from-server
SAMPLING_FROM_SERVER=1 bash scripts/rebench-full.sh --with-8pack-thinking=both
# or an explicit temperature, via benchlocal-cli directly.
# NB: invoking benchlocal-cli directly BYPASSES the wrapper's localhost guard. With a
# localhost endpoint + a sandboxed *agentic* pack (HermesAgent-20 runs the agent INSIDE
# the sandbox), you must set BENCHLOCAL_HERMES_RESOLVE_LOCALHOST=1 yourself — otherwise the
# in-sandbox agent can't reach the host model and hermes silently scores ~0/20.
# quality-test.sh sets this automatically for localhost URLs (see Limitations).
BENCHLOCAL_HERMES_RESOLVE_LOCALHOST=1 \
benchlocal-cli run --full --endpoint http://localhost:8020 --model <name> --temperature 0.8Serving with a model's reasoning flag enabled is necessary but not sufficient: the request also has to send chat_template_kwargs.enable_thinking=true. benchlocal-cli honors each pack's default_thinking metadata, so the dedicated --reasoning suite defaults thinking on for all four packs while many format/extraction packs stay answer-only. Use --enable-thinking only when you want to force thinking on for every pack in a broader mode such as --full:
# dedicated reasoning suite; default thinking is on for these packs
bash scripts/quality-test.sh --reasoning --thinking-max-tokens 16384
# force thinking on for every full-suite pack
bash scripts/quality-test.sh --full --enable-thinking --thinking-max-tokens 16384
# full rebench incl. the 8-pack in both reasoning modes (off + on — the promotion gate).
# The 8-pack thinking is driven by --with-8pack-thinking (=off forces --no-thinking,
# =on forces --enable-thinking), NOT ENABLE_THINKING (which now only affects bench.sh). #338
THINKING_MAX_TOKENS=16384 SAMPLING_FROM_SERVER=1 bash scripts/rebench-full.sh --with-8pack-thinking=both
# TPS bench only
ENABLE_THINKING=1 bash scripts/bench.shIf /props or the running container suggests reasoning is enabled but the wrapper is not forcing thinking on globally, quality-test.sh / bench.sh print a warning; pack defaults still apply, and --enable-thinking forces every pack on. --thinking-max-tokens now passes through independently and only affects packs whose thinking gate resolves on. The default is 16K; hard LiveCodeBench items may still exhaust that budget, so compare with benchlocal-cli run --reasoning --no-thinking when diagnosing budget runaway.
Why it matters: a reasoning / exploratory fine-tune (e.g. Qwopus3.6, whose author recommends temp 0.75–1) is under-represented at temp 0 or with thinking disabled — greedy, thinking-off decoding collapses the path-exploration the fine-tune was trained for. But high temp and reasoning also hurt deterministic packs (DataExtract / StructOutput want exact, repeatable output), so read per-pack deltas, not just the total — and keep canonical temp-0 thinking-off as the bar for any apples-to-apples ranking.
Each compose's Profile header (per AGENTS.md) can carry an optional Quality: line:
# Profile (at-a-glance):
# Model: Qwen3.6-27B (Lorbus AutoRound INT4 + BF16 mtp.fc preserved)
# Topology: Dual 3090 PCIe (TP=2, no NVLink)
# ...
# Status: ✅ Production
# Quality: ToolCall-15 14/15 (93%) · InstructFollow-15 13/15 (87%) · StructOutput-15 15/15 (100%) · DataExtract-15 12/15 (80%) (--medium, packs v1.0.x, 2026-05-09)
# Best for: General-purpose dual-card vision + tools + long-ctx default ⭐The line documents what the compose was tested on. Cross-rig contributors running quality-test.sh against the same compose can paste their numbers as a sibling row in BENCHMARKS.md.
Compact format (one line) so the schema header doesn't bloat. Full per-scenario detail lives in the JSON saved by quality-test.sh, which can be diffed against past runs for regression tracking.
quality-test.sh does NOT enforce a hard pass/fail threshold. The script always exits 0 if the runner completes; you decide whether the scores are acceptable.
Suggested gates (informal, not enforced):
| Pack | Suggested floor | Notes |
|---|---|---|
| ToolCall-15 | ≥80% | Below this, IDE-agent users will report regressions |
| InstructFollow-15 | ≥80% | Below this, format-constraint workflows break |
| StructOutput-15 | ≥90% | JSON shape failures are visible immediately to users |
| DataExtract-15 | ≥75% | Slightly more tolerant; field-level scoring is granular |
| ReasonMath-15 | ≥60% | Reasoning quality varies more by quant; treat as informational |
For comparing a new pin / quant / config A/B against the previous version: a >10pp drop on any pack vs the previous baseline is a signal worth investigating before promoting Status: ✅ Production.
When a harness fix changes how saved runs score (e.g. the benchlocal-cli #79/#81
fairness + reasoning-channel fixes), re-score the SAVED result JSONs with the
rescore subcommand — and write the corrected results back into the artifact:
benchlocal-cli rescore results/rebench/<tag>/quality-full-thinking.json --in-placerescore re-runs the deterministic scorers against each run's saved
raw_response (sandbox packs like hermes/cli-40 are skipped — those need a live
re-run). Printing the corrected totals to stdout and publishing them without
--in-place/--output leaves the tag artifact stale — anything that later
reads the tag (the catalog-baseline.sh induction tool, rebench-report.py,
the measurement-record corpus) silently resurrects the pre-fix numbers. This
bit the Agents-A1 gate: the published thinking-on 110/150 was rescore-corrected,
but the tag JSON read 108/150 until the rescore was materialized on 2026-07-04.
Rule: a rescore that changes a number you publish must be materialized into the tag artifact in the same session (keep a copy of the pre-rescore JSON elsewhere if you want the history; the tag carries the accepted truth).
Rather than hunt down "the previous baseline" by hand each time, the curated corpus in
results/baselines/ holds committed n≥3 aggregates per
(registry-slug, thinking-mode). scripts/quality-baseline.sh
captures and diffs against them:
# capture / refresh a baseline (n=3 aggregate; needs a live endpoint)
bash scripts/quality-baseline.sh --slug vllm/qwen-35b-a3b-dual --capture
# diff a fresh run vs the committed baseline — the regression check
bash scripts/quality-baseline.sh --slug vllm/qwen-35b-a3b-dual
# thinking-on companion baseline
bash scripts/quality-baseline.sh --slug vllm/qwen-35b-a3b-dual --mode enable-thinkingno-thinking is canonical (temp-0, reproducible — diff against this for a CI-style gate);
enable-thinking is the reasoning-on companion. It's a thin wrapper over quality-test.sh --full
(--repeat → --save-json/--previous-result); extra args pass through (e.g.
--exit-on-regression for a hard CI gate). --dry-run prints the resolved command.
bench.shmeasures throughput, not quality. They're complementary.soak-test.shmeasures stability over time. Quality + soak together catch "fast + correct + healthy."- NIAH (needle-in-haystack) tests in
verify-stress.shmeasure long-context retrieval correctness — a different axis than tool-call / instruction-follow.
- Sandboxed packs need Docker — BugFind / HermesAgent / CLI-40 run in Docker-hosted verifier sandboxes. On a host without Docker, run
--medium(or--full --no-sandboxed) for the 5 deterministic packs. - Sandboxed agentic packs need a container-reachable model URL — HermesAgent-20 runs the agent inside the sandbox, so it calls the model over the network. A
localhost/127.x/[::1]endpoint is the container's own loopback, not the host.quality-test.shauto-detects this and exportsBENCHLOCAL_HERMES_RESOLVE_LOCALHOST=1(rewrites the URL →host.docker.internal+ adds--add-host). If you bypass the wrapper and runbenchlocal-clidirectly against a localhost endpoint, set that env var yourself — otherwise the in-sandbox agent never reaches the model and hermes silently scores ~0/20. Failure signature: uniform ~timeout-length per-scenario latencies + flat GPU (notturn_count, which is0for hermes regardless of engagement). - Verifier translation is lossy in places — the upstream BenchLocal evaluators have partial-credit branches we collapsed to pass/fail. See benchlocal-cli's
docs/EXTRACTOR_NOTES.mdfor the specific surfaces. - Single-run sampling at temperature 0 — each scenario runs once, greedy, by default (see Sampling & temperature for the non-canonical override modes). For non-determinism debugging, use
benchlocal-cli run --pack <id> --repeat N.
For the full pipeline architecture + JSONL pack format, read benchlocal-cli's docs.
If quality-test.sh shows a meaningful regression (e.g., ToolCall-15 drops from 14/15 to 8/15 after a Genesis pin bump), file an issue with:
- The compose name + the change that triggered it (Genesis pin bump? new quant? cudagraph mode?)
- The full JSON output from
results/quality/ - The pre-change baseline JSON for diff
- Output of
bash scripts/report.sh --benchfor context (vLLM image SHA, Genesis commit, hardware)
The JSON blobs include enough per-scenario detail to reproduce specific failing scenarios via benchlocal-cli reproduce (post-v0.2 subcommand) for upstream debugging.