SOTA Workload Specification — v1.0

Purpose

Define a single reproducible workload that every framework under test (ruflo, LangGraph, AutoGen, CrewAI, and future additions) runs identically so numbers are directly comparable.

Pinned Parameters

Symbol	Value	Rationale
N	10	Number of concurrent agents spawned in the orchestration-only test
K	50	Number of MCP-equivalent tools registered per agent
T	5	Conversation turns per agent in the end-to-end test
TRIALS	7	Repetitions per sub-benchmark (median taken)
WARMUP	3	Warmup iterations before timing starts

Two Measurement Modes

Mode A — Orchestration-only (isolated overhead)

No real LLM call. A fake-LLM stub returns a canned tool-call JSON in a fixed ~0 ms delay. This isolates the orchestration framework's own overhead: tool registration, dispatch routing, state management, agent lifecycle, and message serialization.

Dimensions measured:

cold-start: Time from process entry to first agent ready (ms)
compose-K-tools: Time to register K MCP-equivalent tools on one agent (ms)
single-turn-dispatch: One agent, one stub-LLM call → tool dispatch → result (ms)
N-agent-parallel-dispatch: N agents all dispatching simultaneously, wall-clock (ms)
RSS-peak: Resident Set Size at peak during N-agent parallel run (MB)

Mode B — End-to-end (real LLM, cheap model)

Uses claude-haiku-4-5 (or equivalent cheapest Haiku variant) with:

max_tokens: 64, temperature: 0
A minimal system prompt: "You are a benchmarking agent. When asked to call a tool, call it."
T turns per agent
Tool schema: a single echo function that returns its input unchanged

Dimensions measured:

first-turn-latency: Wall time from agent.run() to first tool call completed (ms)
per-turn-p50: Median per-turn wall time across T turns (ms)
total-cost-usd: Tokens × per-token price (recorded, not compared)

Mode B is recorded but NOT used as the primary SOTA comparison axis, because model latency dominates and varies by network conditions. Orchestration overhead remains the fair axis.

Identical Tool Fixture

Each framework registers K tools with the same schema. Generated by benchmarks/shared/tool-fixture.mjs:

{
  "name": "tool_NN",
  "description": "Benchmark tool NN — echoes its input.",
  "parameters": {
    "type": "object",
    "properties": { "input": { "type": "string" } },
    "required": ["input"]
  }
}

Tools are named tool_00 … tool_49 (for K=50).

Fake-LLM Stub Protocol

Frameworks receive a stub "LLM" that:

Always returns a tool call to tool_00 with input: "bench".
Returns synchronously (0 ms artificial delay) to remove network variance.
Terminates after T calls so conversation ends naturally.

For Python frameworks the stub is a FakeListChatModel/FakeLLM subclass. For Node.js frameworks the stub is an async function returning { tool: "tool_00", input: "bench" }.

Single-Command Repro

# Full matrix (all frameworks, both modes)
node benchmarks/run-sota-matrix.mjs

# Single framework
node benchmarks/comparators/langgraph/run.py --mode A --trials 7
node benchmarks/comparators/autogen/run.py   --mode A --trials 7
node benchmarks/comparators/crewai/run.py    --mode A --trials 7
node benchmarks/comparators/ruflo/run.mjs    --mode A --trials 7

Output: docs/benchmarks/sota-matrix.json

Platform & Environment

Primary: darwin-arm64 (Apple M-series)
Secondary: linux-x64 (CI runner, ubuntu-latest)
Node.js ≥ 22 (pinned to whatever node --version reports)
Python ≥ 3.12
Package versions pinned in benchmarks/comparators/<framework>/requirements.txt or package.json respectively.

Comparator Versions (pinned 2026-05-24)

Framework	Version	Language	Notes
ruflo / @claude-flow/cli	3.8.0 (current dist)	Node.js
LangGraph	1.2.1 (installed)	Python	langgraph-prebuilt 1.1.0
AutoGen	autogen-agentchat 0.4.9	Python	autogen-core 0.4.9
CrewAI	0.80.0	Python	Requires setuptools>=70 for pkg_resources
Anthropic Agent SDK	latest	Python	Mode B only (future)

What "SOTA" Means Here

A framework is SOTA on a dimension if it achieves the lowest median value across all frameworks on that dimension, reproducibly, across both platforms. Claims are qualified with the dimension name: "SOTA on compose latency" ≠ "SOTA overall".

Exclusions & Honest Notes

Mastra (Node.js): included in future work; requires TypeScript compilation step not yet wired.
Atomic Agents: Python-only alpha; evaluated but not in primary matrix yet.
Mode B numbers are network-sensitive and labeled "(indicative)".
RSS numbers on darwin-arm64 reflect unified memory architecture; may differ on Linux.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOTA Workload Specification — v1.0

Purpose

Pinned Parameters

Two Measurement Modes

Mode A — Orchestration-only (isolated overhead)

Mode B — End-to-end (real LLM, cheap model)

Identical Tool Fixture

Fake-LLM Stub Protocol

Single-Command Repro

Platform & Environment

Comparator Versions (pinned 2026-05-24)

What "SOTA" Means Here

Exclusions & Honest Notes

FilesExpand file tree

sota-workload-spec.md

Latest commit

History

sota-workload-spec.md

File metadata and controls

SOTA Workload Specification — v1.0

Purpose

Pinned Parameters

Two Measurement Modes

Mode A — Orchestration-only (isolated overhead)

Mode B — End-to-end (real LLM, cheap model)

Identical Tool Fixture

Fake-LLM Stub Protocol

Single-Command Repro

Platform & Environment

Comparator Versions (pinned 2026-05-24)

What "SOTA" Means Here

Exclusions & Honest Notes