Define a single reproducible workload that every framework under test (ruflo, LangGraph, AutoGen, CrewAI, and future additions) runs identically so numbers are directly comparable.
| Symbol | Value | Rationale |
|---|---|---|
| N | 10 | Number of concurrent agents spawned in the orchestration-only test |
| K | 50 | Number of MCP-equivalent tools registered per agent |
| T | 5 | Conversation turns per agent in the end-to-end test |
| TRIALS | 7 | Repetitions per sub-benchmark (median taken) |
| WARMUP | 3 | Warmup iterations before timing starts |
No real LLM call. A fake-LLM stub returns a canned tool-call JSON in a fixed ~0 ms delay. This isolates the orchestration framework's own overhead: tool registration, dispatch routing, state management, agent lifecycle, and message serialization.
Dimensions measured:
- cold-start: Time from process entry to first agent ready (ms)
- compose-K-tools: Time to register K MCP-equivalent tools on one agent (ms)
- single-turn-dispatch: One agent, one stub-LLM call → tool dispatch → result (ms)
- N-agent-parallel-dispatch: N agents all dispatching simultaneously, wall-clock (ms)
- RSS-peak: Resident Set Size at peak during N-agent parallel run (MB)
Uses claude-haiku-4-5 (or equivalent cheapest Haiku variant) with:
max_tokens: 64,temperature: 0- A minimal system prompt: "You are a benchmarking agent. When asked to call a tool, call it."
- T turns per agent
- Tool schema: a single
echofunction that returns its input unchanged
Dimensions measured:
- first-turn-latency: Wall time from agent.run() to first tool call completed (ms)
- per-turn-p50: Median per-turn wall time across T turns (ms)
- total-cost-usd: Tokens × per-token price (recorded, not compared)
Mode B is recorded but NOT used as the primary SOTA comparison axis, because model latency dominates and varies by network conditions. Orchestration overhead remains the fair axis.
Each framework registers K tools with the same schema. Generated by
benchmarks/shared/tool-fixture.mjs:
{
"name": "tool_NN",
"description": "Benchmark tool NN — echoes its input.",
"parameters": {
"type": "object",
"properties": { "input": { "type": "string" } },
"required": ["input"]
}
}Tools are named tool_00 … tool_49 (for K=50).
Frameworks receive a stub "LLM" that:
- Always returns a tool call to
tool_00withinput: "bench". - Returns synchronously (0 ms artificial delay) to remove network variance.
- Terminates after T calls so conversation ends naturally.
For Python frameworks the stub is a FakeListChatModel/FakeLLM subclass.
For Node.js frameworks the stub is an async function returning { tool: "tool_00", input: "bench" }.
# Full matrix (all frameworks, both modes)
node benchmarks/run-sota-matrix.mjs
# Single framework
node benchmarks/comparators/langgraph/run.py --mode A --trials 7
node benchmarks/comparators/autogen/run.py --mode A --trials 7
node benchmarks/comparators/crewai/run.py --mode A --trials 7
node benchmarks/comparators/ruflo/run.mjs --mode A --trials 7Output: docs/benchmarks/sota-matrix.json
- Primary: darwin-arm64 (Apple M-series)
- Secondary: linux-x64 (CI runner, ubuntu-latest)
- Node.js ≥ 22 (pinned to whatever
node --versionreports) - Python ≥ 3.12
- Package versions pinned in
benchmarks/comparators/<framework>/requirements.txtorpackage.jsonrespectively.
| Framework | Version | Language | Notes |
|---|---|---|---|
| ruflo / @claude-flow/cli | 3.8.0 (current dist) | Node.js | |
| LangGraph | 1.2.1 (installed) | Python | langgraph-prebuilt 1.1.0 |
| AutoGen | autogen-agentchat 0.4.9 | Python | autogen-core 0.4.9 |
| CrewAI | 0.80.0 | Python | Requires setuptools>=70 for pkg_resources |
| Anthropic Agent SDK | latest | Python | Mode B only (future) |
A framework is SOTA on a dimension if it achieves the lowest median value across all frameworks on that dimension, reproducibly, across both platforms. Claims are qualified with the dimension name: "SOTA on compose latency" ≠ "SOTA overall".
- Mastra (Node.js): included in future work; requires TypeScript compilation step not yet wired.
- Atomic Agents: Python-only alpha; evaluated but not in primary matrix yet.
- Mode B numbers are network-sensitive and labeled "(indicative)".
- RSS numbers on darwin-arm64 reflect unified memory architecture; may differ on Linux.