Skip to content

Latest commit

 

History

History
123 lines (92 loc) · 4.8 KB

File metadata and controls

123 lines (92 loc) · 4.8 KB

SOTA Workload Specification — v1.0

Purpose

Define a single reproducible workload that every framework under test (ruflo, LangGraph, AutoGen, CrewAI, and future additions) runs identically so numbers are directly comparable.

Pinned Parameters

Symbol Value Rationale
N 10 Number of concurrent agents spawned in the orchestration-only test
K 50 Number of MCP-equivalent tools registered per agent
T 5 Conversation turns per agent in the end-to-end test
TRIALS 7 Repetitions per sub-benchmark (median taken)
WARMUP 3 Warmup iterations before timing starts

Two Measurement Modes

Mode A — Orchestration-only (isolated overhead)

No real LLM call. A fake-LLM stub returns a canned tool-call JSON in a fixed ~0 ms delay. This isolates the orchestration framework's own overhead: tool registration, dispatch routing, state management, agent lifecycle, and message serialization.

Dimensions measured:

  1. cold-start: Time from process entry to first agent ready (ms)
  2. compose-K-tools: Time to register K MCP-equivalent tools on one agent (ms)
  3. single-turn-dispatch: One agent, one stub-LLM call → tool dispatch → result (ms)
  4. N-agent-parallel-dispatch: N agents all dispatching simultaneously, wall-clock (ms)
  5. RSS-peak: Resident Set Size at peak during N-agent parallel run (MB)

Mode B — End-to-end (real LLM, cheap model)

Uses claude-haiku-4-5 (or equivalent cheapest Haiku variant) with:

  • max_tokens: 64, temperature: 0
  • A minimal system prompt: "You are a benchmarking agent. When asked to call a tool, call it."
  • T turns per agent
  • Tool schema: a single echo function that returns its input unchanged

Dimensions measured:

  1. first-turn-latency: Wall time from agent.run() to first tool call completed (ms)
  2. per-turn-p50: Median per-turn wall time across T turns (ms)
  3. total-cost-usd: Tokens × per-token price (recorded, not compared)

Mode B is recorded but NOT used as the primary SOTA comparison axis, because model latency dominates and varies by network conditions. Orchestration overhead remains the fair axis.

Identical Tool Fixture

Each framework registers K tools with the same schema. Generated by benchmarks/shared/tool-fixture.mjs:

{
  "name": "tool_NN",
  "description": "Benchmark tool NN — echoes its input.",
  "parameters": {
    "type": "object",
    "properties": { "input": { "type": "string" } },
    "required": ["input"]
  }
}

Tools are named tool_00tool_49 (for K=50).

Fake-LLM Stub Protocol

Frameworks receive a stub "LLM" that:

  1. Always returns a tool call to tool_00 with input: "bench".
  2. Returns synchronously (0 ms artificial delay) to remove network variance.
  3. Terminates after T calls so conversation ends naturally.

For Python frameworks the stub is a FakeListChatModel/FakeLLM subclass. For Node.js frameworks the stub is an async function returning { tool: "tool_00", input: "bench" }.

Single-Command Repro

# Full matrix (all frameworks, both modes)
node benchmarks/run-sota-matrix.mjs

# Single framework
node benchmarks/comparators/langgraph/run.py --mode A --trials 7
node benchmarks/comparators/autogen/run.py   --mode A --trials 7
node benchmarks/comparators/crewai/run.py    --mode A --trials 7
node benchmarks/comparators/ruflo/run.mjs    --mode A --trials 7

Output: docs/benchmarks/sota-matrix.json

Platform & Environment

  • Primary: darwin-arm64 (Apple M-series)
  • Secondary: linux-x64 (CI runner, ubuntu-latest)
  • Node.js ≥ 22 (pinned to whatever node --version reports)
  • Python ≥ 3.12
  • Package versions pinned in benchmarks/comparators/<framework>/requirements.txt or package.json respectively.

Comparator Versions (pinned 2026-05-24)

Framework Version Language Notes
ruflo / @claude-flow/cli 3.8.0 (current dist) Node.js
LangGraph 1.2.1 (installed) Python langgraph-prebuilt 1.1.0
AutoGen autogen-agentchat 0.4.9 Python autogen-core 0.4.9
CrewAI 0.80.0 Python Requires setuptools>=70 for pkg_resources
Anthropic Agent SDK latest Python Mode B only (future)

What "SOTA" Means Here

A framework is SOTA on a dimension if it achieves the lowest median value across all frameworks on that dimension, reproducibly, across both platforms. Claims are qualified with the dimension name: "SOTA on compose latency" ≠ "SOTA overall".

Exclusions & Honest Notes

  • Mastra (Node.js): included in future work; requires TypeScript compilation step not yet wired.
  • Atomic Agents: Python-only alpha; evaluated but not in primary matrix yet.
  • Mode B numbers are network-sensitive and labeled "(indicative)".
  • RSS numbers on darwin-arm64 reflect unified memory architecture; may differ on Linux.