🛠️ TRIAGE-X: SRE Incident Response Benchmark

title	TRIAGE-X Benchmark
colorFrom	blue
colorTo	indigo
sdk	docker
app_port	7860
pinned	false

🛠️ TRIAGE-X: SRE Incident Response Benchmark

"The first high-fidelity, deterministic SRE simulation environment for Autonomous AI Agents."

🎯 Project Overview

TRIAGE-X is a production-grade incident response simulator designed for the OpenEnv Reinforcement Learning Benchmark. Unlike generic "toy" environments (games, math, or simple web navigation), TRIAGE-X models the high-stakes reality of Site Reliability Engineering (SRE).

AI agents are placed in a live "NOC" where they must diagnose cascading microservice failures, manage a limited infrastructure budget, and restore system health without crashing healthy components.

🏗️ System Architecture (Simulated Topology)

graph TD
    User((User Traffic)) --> AGW[API Gateway]
    AGW --> AS[Auth Service]
    AGW --> PW[Payment Worker]
    PW --> DB[(Database Replica)]
    AGW --> CC[Cache Cluster]
    NQ[Notification Queue] -.-> PW

🚀 Key Differentiators

Important

Real-World Utility (30% Weight): Models authentic microservice failure patterns including Backpressure Bottlenecks, Hidden Cascading Dependencies, and Alert Fatigue.

⚡ Deterministic Logic: Every task variant (v1, v2, v3) is 100% deterministic, ensuring reproducible agent benchmarks across multiple runs.
⚖️ Multi-Dimensional Grading: Graders don't just check "Did it work?". They analyze SLA Stability, Budget Utilization, and Action Efficiency.
📽️ Observability Dashboard: Includes a minimalist React/Vite "War Room" dashboard for human verification of agent trajectories.
📦 OpenEnv Spec Native: Strictly implements POST /reset, POST /step, GET /state, GET /tasks, GET /score, and GET /health.
🐍 Multi-Mode Compliance: Fully compliant with Python-based OpenEnv validators using pyproject.toml, uv.lock, and a standard server entry point.

🧩 Benchmark Scenarios (Tasks)

The environment ships with 3 distinct grading difficulties producing a normalized final episode score between 0.0 and 1.0:

Task	Difficulty	Pattern Modeled
`easy_signal_noise`	🟢 Easy	Queue bottlenecks & Horizontal Scaling.
`medium_hidden_dependency`	🟡 Medium	Cascading failures & Latency propagation.
`hard_multi_incident`	🔴 Hard	Concurrent cluster anomalies & Budget management.

⚖️ Scoring & Reward Function

The rewardEngine.js provides granular trajectory signals (shaping) in the range of [-1.0, 1.0]. The final score is calculated using the following Meta-aligned weights:

Dimension	Weight	Metric
System Stability	30%	Final Avg. Health vs. SLA Target
Harm Reduction	25%	Cumulative Customer Impact saved
Root Cause Resolution	25%	Binary check for core issue fix
Action Efficiency	10%	Steps used vs. Max allowed
Budget Utilization	5%	Infrastructure cost management
Safety Violation	5%	Penalty for "Reckless Rebooting"

🎮 Action & Observation Space

Observation Space (`GET /state`)

{
  "system_health": 0.85,        // Normalized stability score
  "customer_impact": 12.5,        // Downstream severity (Lower is better)
  "remaining_budget": 1200,       // $ cost of cloud resources
  "services": [...],              // Component telemetry (Latency, Errors, Health)
  "active_alerts": [...]          // CloudWatch / Datadog simulants
}

Action Space (`POST /step`)

Agents specify an action and a target:

inspect_service: Reveal hidden internal telemetry.
inspect_dependency: Trace downstream routing maps.
restart_service: Hard cycle a component (High Cost).
throttle_queue: Drop traffic to clear backpressure.
rollback_deploy: Revert a faulty deployment signature.
scale_service: Add horizontal instances to handle load.

🛠️ Setup & Local Execution

Option A: Python / UV (Recommended for Validators)

The environment is compatible with uv and standard Python entry points:

# Install dependencies
uv sync

# Start the environment server (Wraps Node.js backend)
uv run server

Option B: Native Node.js Backend

If you prefer running the core server directly:

cd server
npm install
npm start

2. Boot Observability Dashboard (Port 5173)

cd client
npm install
npm run dev

3. Agent Inference Logic

To run the benchmark against an LLM (Default: gpt-4o-mini):

# Set OPENAI_API_KEY in .env
python3 inference.py

(Inference script emits strictly formatted logs: [START], [STEP], and [END] for evaluator scaling).

🐳 Container Deployment (HF Spaces)

Root contains a compliant Dockerfile. The space is configured as a docker SDK space but supports multi-mode interaction via the server entry point.

docker build -t triage-x .
docker run -p 7860:7860 triage-x

Created for the Meta x Hugging Face Hackathon - TRIAGE-X Benchmark Environment.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
client		client
docs		docs
inference		inference
scripts		scripts
server		server
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
inference.py		inference.py
openenv.yaml		openenv.yaml
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml
python		python
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛠️ TRIAGE-X: SRE Incident Response Benchmark

🎯 Project Overview

🏗️ System Architecture (Simulated Topology)

🚀 Key Differentiators

🧩 Benchmark Scenarios (Tasks)

⚖️ Scoring & Reward Function

🎮 Action & Observation Space

Observation Space (`GET /state`)

Action Space (`POST /step`)

🛠️ Setup & Local Execution

Option A: Python / UV (Recommended for Validators)

Option B: Native Node.js Backend

2. Boot Observability Dashboard (Port 5173)

3. Agent Inference Logic

🐳 Container Deployment (HF Spaces)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🛠️ TRIAGE-X: SRE Incident Response Benchmark

🎯 Project Overview

🏗️ System Architecture (Simulated Topology)

🚀 Key Differentiators

🧩 Benchmark Scenarios (Tasks)

⚖️ Scoring & Reward Function

🎮 Action & Observation Space

Observation Space (GET /state)

Action Space (POST /step)

🛠️ Setup & Local Execution

Option A: Python / UV (Recommended for Validators)

Option B: Native Node.js Backend

2. Boot Observability Dashboard (Port 5173)

3. Agent Inference Logic

🐳 Container Deployment (HF Spaces)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Observation Space (`GET /state`)

Action Space (`POST /step`)

Packages