Measure how similar N agent outputs are. Score exact-match rate, Jaccard token overlap, divergence point, and a composite 0–1 convergence score over any list of agent runs.
If you run the same prompt through N agents and want a number for "are they producing N distinct outputs or have they collapsed to one idea?" — this is that number.
- You just ran a fan-out of N agents and eyeballing whether they converged is slow and subjective.
- Your eval harness reports accuracy but not reproducibility; same prompt, two runs, two answers, no metric.
- Multi-agent hackathon or swarm setup; half the agents picked the same target. You want evidence, not vibes.
- LLM temperature study where "temp=0.3 vs temp=0.7" needs a downstream consistency number.
- You caught agents rephrasing each other but there is no column in your CSV for it.
pip install agent-convergence-scorerPython 3.9+. Zero runtime dependencies (stdlib only).
echo '{"runs": ["The capital is Paris.", "The capital is Paris.", "The capital is Lyon."]}' \
| agent-convergence-scorer -Output:
{
"num_runs": 3,
"exact_match_rate": 0.667,
"token_metrics": {
"avg_overlap": 0.733,
"jaccard": 1.0
},
"convergence_score": 0.703,
"divergence_point": {
"diverges_at_token": "paris.",
"token_position": 3,
"num_tokens_to_divergence": 3
}
}Interpret:
convergence_score = 0.703— high but not perfect consistency.exact_match_rate = 0.667— 2 of 3 runs identical to run 0.- Divergence at token 3 — they agreed on the prefix "The capital is" then split.
from agent_convergence_scorer import score_runs
runs = [
"The answer is A",
"The answer is B",
"The answer is C",
]
print(score_runs(runs))
# {'num_runs': 3, 'exact_match_rate': 0.333,
# 'token_metrics': {'avg_overlap': 0.6, 'jaccard': 0.6},
# 'convergence_score': 0.497,
# 'divergence_point': {'diverges_at_token': 'a', 'token_position': 3, 'num_tokens_to_divergence': 3}}Individual metrics are importable too: exact_match_rate, token_overlap, divergence_point, convergence_score, tokenize.
| Metric | Range | What it measures |
|---|---|---|
exact_match_rate |
[0, 1] |
Fraction of runs byte-identical to runs[0]. Crude reproducibility floor. |
token_metrics.jaccard |
[0, 1] |
Token-set Jaccard of the first two runs (quick eyeball). |
token_metrics.avg_overlap |
[0, 1] |
Mean Jaccard over all C(N,2) pairs. Robust to N. |
divergence_point.num_tokens_to_divergence |
[0, min_len] |
First position where runs disagree. Late divergence = strong shared prefix. |
convergence_score |
[0, 1] |
Composite: 0.5 * exact_match + 0.3 * avg_overlap + 0.2 * div_distance_norm. |
- Quick single-number consistency check for multi-agent fan-outs.
- CI gate: fail if N reruns of a prompt drop below a convergence threshold.
- Measuring the effect of a temperature, prompt, or framing change on output stability.
- Quantifying ideation collapse in multi-agent hackathons (N agents → how many distinct ideas?).
- Semantic similarity. Tokenization is whitespace-only; "Paris, France" and "paris, france," are different token sets. If you need meaning-level comparison, pair these metrics with a sentence-embedding similarity (or a reranker) externally.
- Subword tokenization studies. This is not a BPE/WordPiece tokenizer.
- Multilingual corpora where whitespace isn't the word boundary (Chinese, Japanese, Thai, etc.) — tokenize upstream, pass the tokenized-then-joined form.
- Ranking quality (nDCG, MRR, etc.) — use
ir-measuresorranxinstead. - Concurrency-safe incremental scoring over streams — this is a batch tool.
The composite weights (50/30/20) are heuristic; override by calling the individual functions and combining yourself.
from agent_convergence_scorer import score_runs
# 4 agents, same prompt, different (or identical) outputs
runs = [agent.run(prompt) for agent in agents]
result = score_runs(runs)
if result["convergence_score"] > 0.8:
print(f"⚠️ collapse: {result['convergence_score']:.2f} — agents are rephrasing each other")
else:
print(f"✓ diverse: {result['convergence_score']:.2f}")Built during the Hermes Labs Cascade Hackathon on 2026-04-22, as part of a controlled experiment measuring whether prompt framing affects ideation diversity across N concurrent agents. In the prior-day baseline, 12 agents sharing context collapsed to 2 dominant idea clusters; in the cascade experiment, agents under distinct-persona or distinct-constraint framing produced 4 distinct clusters per arm of 4. This scorer is the mechanism by which the collapse was measured.
- Tamper evidence: the repository carries a staged
hermes-sealv1 manifest at.hermes-seal.yaml. Signature is granted out-of-band with a root-owned key and verified by the Hermes Labs internal sealing toolchain. - SBOM:
sbom.cdx.json(CycloneDX 1.5) at repo root. - Security policy: see SECURITY.md.
See CONTRIBUTING.md. Issues and PRs welcome. For agent-driven contributors, see AGENTS.md.
MIT — see LICENSE.
Hermes Labs builds AI audit infrastructure for enterprise AI systems — EU AI Act readiness, ISO 42001 evidence bundles, continuous compliance monitoring, agent-level risk testing. We work with teams shipping AI into regulated environments.
Our OSS philosophy — read this if you're deciding whether to depend on us:
- Everything we release is MIT, fully free, forever. No "open core," no SaaS tier upsell, no paid version with the features you actually need. You can run everything in this repo standalone, commercially, without talking to us.
- We open-source our own infrastructure. This package, and the ones below, are the tools Hermes Labs uses internally to audit its own agents and to produce audit deliverables for customers. We don't publish demo code — we publish production code.
- We sell audit work, not licenses. If you want an ANNEX-IV pack, an ISO 42001 evidence bundle, gap analysis against the EU AI Act, or agent-level red-teaming delivered as a report, that's at hermes-labs.ai. If you just want the code to run it yourself, it's right here.
The Hermes Labs OSS stack (public, MIT, open-source):
| Tool | What it does |
|---|---|
| lintlang | Static linter for AI agent configs, tool descriptions, system prompts. Zero-LLM CI gate. pip install lintlang |
| little-canary | Prompt injection detection for LLM apps using sacrificial canary-model probes + structural preflight |
| hermes-jailbench | Jailbreak regression benchmark for LLM endpoints — repeatable known-pattern attacks, deterministic scoring. pip install hermes-jailbench |
| claude-router | Router that picks the right Claude model tier + scaffold using local embeddings. pip install claude-router |
| zer0dex | Local dual-layer memory for AI agents — compressed index + vector retrieval |
| colony-probe | Defensive prompt confidentiality audit — detects system-prompt reconstruction via multi-turn probing |
| suy-sideguy | Runtime policy guard for autonomous agents — user-space enforcement + forensic reporting |
| agent-gorgon | Stops agents from fabricating tool output when a registered tool exists — 3-layer Claude Code hook defense |
| rule-audit | Static prompt audit — contradictions, coverage gaps, priority ambiguities, edge cases |
| intent-verify | Repo intent verification + spec-drift checks against markdown specs and handoffs |
| quick-gate-python / quick-gate-js | Quality-gate CLI with bounded auto-repair + escalation artifacts |
| repo-audit | 15-second launch-readiness punch-list for any public GitHub repo |
Pair this scorer with lintlang and hermes-jailbench for a defensible "is my agent behaving consistently" gate in CI.
If this saved you the five minutes of eyeballing a fan-out's outputs, ⭐ the repo — it helps others find it.