GitHub - khan5v/kalibra: Statistical regression detection and CI quality gates for AI agents

Kalibra
Regression detection and CI quality gates for AI agents.

Success rate: 80% → 80%. Duration: flat. Tokens: flat. Everything looks the same — but 2 task types that always passed started failing, and 2 that always failed started passing. The aggregate hid it. The per-task breakdown caught it.

pip install kalibra
kalibra compare baseline.jsonl current.jsonl -v
kalibra demo    # try it with sample data

Statistical rigor — bootstrap 95% CIs on continuous metrics, two-proportion z-test on rates
Quality gates — regressions <= 2 fails your CI pipeline (exit 1) when thresholds are violated
Per-task and per-span breakdown — catches regressions that cancel out in the aggregate
Two dependencies — click + pyyaml. No ML frameworks, no API keys, no LLM calls

Quality gates for CI

# kalibra.yml
baseline:
  path: ./baselines/production.jsonl
current:
  path: ./eval-output/canary.jsonl

require:
  - success_rate_delta >= -2     # max 2pp success rate drop
  - regressions <= 5             # max 5 tasks regressed
  - cost_delta_pct <= 20         # max 20% cost increase

kalibra compare        # reads kalibra.yml, exits 1 on failure

GitHub Actions

- uses: khan5v/kalibra-action@v1
  with:
    baseline: baselines/production.jsonl
    current: current.jsonl
    config: kalibra.yml

Posts a markdown report as a PR comment. Exits 1 on gate failure.

Full workflow example

name: Agent Quality Gate
on: [pull_request]

jobs:
  kalibra:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v5
      - run: python eval.py --output current.jsonl
      - uses: khan5v/kalibra-action@v1
        with:
          baseline: baselines/production.jsonl
          current: current.jsonl
          config: kalibra.yml

Integrations

Kalibra auto-detects trace formats. Each tutorial works without an API key.

Integration	Trace format	Demo scenario
Phoenix / OpenInference	`llm.`, `openinference.`	Multi-step agent with span tree aggregation
OTel GenAI	`gen_ai.*`	Truncation regression hidden by aggregate improvement
CrewAI	Flat JSONL	Failure redistribution and cost explosion

Filtering with where

Split a single trace file into populations using Prometheus-style matchers:

sources:
  baseline:
    path: ./traces.jsonl
    where:
      - variant == baseline
  current:
    path: ./traces.jsonl
    where:
      - variant == current

Operators: == (equal), != (not equal), =~ (regex match), !~ (regex not match). Multiple matchers are ANDed. Traces missing the field are excluded.

Field mapping

Kalibra works with any JSONL shape. Map your fields in config or on the command line:

fields:
  outcome: metadata.result
  cost: agent_cost.total_cost
  task_id: metadata.task_name

kalibra compare a.jsonl b.jsonl --outcome metadata.result --cost usage.total_cost

Override fields per source for different schemas:

baseline:
  path: ./langfuse.jsonl
  fields: { outcome: metadata.result, cost: usage.total_cost }
current:
  path: ./braintrust.jsonl
  fields: { outcome: scores.correctness, cost: metrics.cost }

Python API

from kalibra.loader import load_traces
from kalibra.engine import compare
from kalibra.renderers import render

baseline = load_traces("baseline.jsonl")
current = load_traces("current.jsonl")

result = compare(baseline, current, require=["success_rate_delta >= -5"])
print(render(result, "terminal", verbose=True))
print("passed:", result.passed)

Development

git clone https://github.com/khan5v/kalibra.git
cd kalibra
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
src/kalibra		src/kalibra
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quality gates for CI

GitHub Actions

Integrations

Development

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Quality gates for CI

GitHub Actions

Integrations

Development

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages