PRD: Gollem Autonomous Improvement Harness
Author: Trevor Prater / Fugue Labs
Status: Draft v2
Date: 2026-03-07
Inspiration: karpathy/autoresearch
Problem Statement
Gollem scored 93% (14/15 tasks) on Terminal Bench 2. Improving beyond this requires iterating on the agent framework itself — modifying tool implementations, prompt strategies, middleware chains, reasoning patterns, and orchestration logic. Currently, this iteration is manual and expensive.
Karpathy’s autoresearch demonstrates that this loop can be fully autonomous. We adapt his pattern with a critical twist: the harness itself is a Gollem agent, written in Go, using the Gollem framework. Gollem improves Gollem. The framework is both the subject and the instrument of optimization.
Design Philosophy
- Gollem all the way down. The harness is a Go program using
github.com/fugue-labs/gollem. The researcher agent that modifies code, runs evals, and decides keep/discard is a gollem.Agent[ExperimentResult] with typed tools.
- Single metric. Terminal Bench task pass rate.
- Fixed eval budget. Each eval run uses the same task set, same model, same time limits.
- Git-based versioning. Every experiment is a commit. Keep = advance. Discard = reset.
- Never stop. The agent runs indefinitely until manually interrupted.
- Eat your own dog food. Every Gollem primitive — middleware, guardrails, cost tracking, tracing, structured output — is exercised by the harness itself. If the framework has a gap, the harness exposes it.
Architecture
Karpathy’s Pattern
Human → "read program.md and start" → Claude Code
Claude Code modifies train.py
Claude Code runs `uv run train.py`
Claude Code parses results
Claude Code keeps/discards
LOOP
The agent is Claude Code. The instructions are in program.md. The loop is implicit in the conversation.
Gollem AutoEval Pattern
Human → `go run cmd/autoeval/main.go`
Gollem Agent[ExperimentResult] runs autonomously
├── Tool: ReadFile (read agent config, traces, results)
├── Tool: WriteFile (modify agent config, prompts, tools)
├── Tool: GitCommit (snapshot experiment)
├── Tool: RunEval (execute Terminal Bench subset)
├── Tool: ParseResults (extract pass_rate from results.json)
├── Tool: GitReset (discard failed experiment)
├── Tool: ReadTraces (analyze why tasks failed)
└── Tool: AnalyzeHistory (review results.tsv for patterns)
LOOP via Gollem run condition (never terminates)
The agent is a Gollem agent. The instructions are the system prompt. The loop is explicit in Go code. The tools are gollem.FuncTool[T] with typed parameters. Everything is compile-time checked.
Repository Structure
gollem-autoeval/
├── cmd/
│ └── autoeval/
│ └── main.go ← Entry point: constructs and runs the researcher agent
├── internal/
│ ├── researcher/
│ │ ├── agent.go ← Researcher agent construction (system prompt, tools, middleware)
│ │ ├── tools.go ← Tool definitions: ReadFile, WriteFile, RunEval, Git*, etc.
│ │ ├── types.go ← ExperimentResult, EvalOutput, TraceAnalysis structs
│ │ └── prompts.go ← System prompt and experiment strategy instructions
│ ├── eval/
│ │ ├── runner.go ← Terminal Bench eval execution wrapper
│ │ ├── parser.go ← Results.json parser
│ │ └── constants.go ← Fixed eval parameters (DO NOT MODIFY)
│ └── git/
│ └── git.go ← Git operations: commit, reset, branch, log
├── subject/ ← THE SCOPE OF MODIFICATION (what the agent optimizes)
│ ├── config.yaml ← Agent configuration for Terminal Bench
│ ├── system_prompt.md ← System prompt for the terminal agent
│ ├── tools/ ← Tool implementations
│ │ ├── bash.go
│ │ ├── file_edit.go
│ │ ├── file_read.go
│ │ └── search.go
│ ├── middleware/ ← Middleware chain
│ │ ├── retry.go
│ │ ├── context.go
│ │ └── planning.go
│ └── strategy/ ← Execution strategy
│ └── terminal.go
├── results.tsv ← Experiment log
├── traces/ ← OTEL traces from each eval run
├── go.mod
└── go.sum
Core Types
package researcher
import "github.com/fugue-labs/gollem"
// ExperimentResult is the structured output of each experiment cycle.
// The researcher agent must produce this after every eval.
type ExperimentResult struct {
Commit string `json:"commit" jsonschema:"description=Git commit hash (7 chars)"`
PassRate float64 `json:"pass_rate" jsonschema:"description=Task pass rate 0.0-1.0"`
TokensUsed int `json:"tokens_used" jsonschema:"description=Total tokens consumed"`
Status string `json:"status" jsonschema:"enum=keep|discard|crash"`
Description string `json:"description" jsonschema:"description=What this experiment tried"`
Hypothesis string `json:"hypothesis" jsonschema:"description=Why you expected this to work"`
NextIdea string `json:"next_idea" jsonschema:"description=What to try next based on results"`
}
// EvalOutput is parsed from Terminal Bench results.
type EvalOutput struct {
PassRate float64 `json:"pass_rate"`
TasksPassed int `json:"tasks_passed"`
TasksTotal int `json:"tasks_total"`
TasksFailed []string `json:"tasks_failed"`
DurationSecs float64 `json:"duration_secs"`
TokensUsed int `json:"tokens_used"`
}
// TraceAnalysis is what the agent produces after reading failure traces.
type TraceAnalysis struct {
TaskID string `json:"task_id"`
FailureMode string `json:"failure_mode"`
RootCause string `json:"root_cause"`
SuggestedFix string `json:"suggested_fix"`
}
Agent Construction
package researcher
import (
"github.com/fugue-labs/gollem"
"github.com/fugue-labs/gollem/provider/anthropic"
// or for local inference:
// "github.com/fugue-labs/gollem/provider/openai" // ollama is openai-compatible
)
func NewResearcherAgent(cfg Config) *gollem.Agent[ExperimentResult] {
model := selectModel(cfg) // anthropic.New() or openai-compat ollama
tracker := gollem.NewCostTracker(modelPricing)
return gollem.NewAgent[ExperimentResult](model,
// Identity
gollem.WithSystemPrompt[ExperimentResult](researcherSystemPrompt),
// Tools — the researcher's capabilities
gollem.WithTools[ExperimentResult](
readFileTool(), // Read any file in subject/ or traces/
writeFileTool(), // Write/modify files in subject/ only
runEvalTool(cfg), // Execute Terminal Bench eval
parseResultsTool(), // Parse eval output
gitCommitTool(), // Commit current subject/ state
gitResetTool(), // Reset to previous best commit
gitLogTool(), // View experiment history
readTracesTool(), // Read OTEL traces from failed tasks
readResultsTsvTool(), // Read the results log
bashTool(), // Escape hatch: run arbitrary commands
),
// Safety
gollem.WithTurnGuardrail[ExperimentResult]("max_turns",
gollem.MaxTurns(100), // per experiment cycle
),
gollem.WithInputGuardrail[ExperimentResult]("scope",
scopeGuardrail(), // prevent writes outside subject/
),
// Observability
gollem.WithCostTracker[ExperimentResult](tracker),
gollem.WithTracing[ExperimentResult](),
gollem.WithTraceExporter[ExperimentResult](
gollem.NewJSONFileExporter("./harness-traces"),
),
gollem.WithHooks[ExperimentResult](gollem.Hook{
OnToolStart: func(ctx context.Context, rc *gollem.RunContext, name, args string) {
log.Printf("[researcher] tool: %s", name)
},
}),
// Middleware
gollem.WithAgentMiddleware[ExperimentResult](
gollem.TimingMiddleware(func(d time.Duration) {
log.Printf("[researcher] model call: %v", d)
}),
),
gollem.WithAgentMiddleware[ExperimentResult](
gollem.LoggingMiddleware(log.Printf),
),
// Context management — the researcher will have long conversations
gollem.WithAutoContext[ExperimentResult](gollem.AutoContextConfig{
MaxTokens: 100000,
KeepLastN: 20,
}),
)
}
The Loop
package main
import (
"context"
"log"
"os"
"os/signal"
"github.com/fugue-labs/gollem-autoeval/internal/researcher"
)
func main() {
cfg := researcher.LoadConfig()
agent := researcher.NewResearcherAgent(cfg)
// Graceful shutdown on interrupt
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, os.Interrupt)
go func() {
<-sigCh
log.Println("Interrupt received, finishing current experiment...")
cancel()
}()
// The outer loop: each iteration is one full experiment cycle.
// The agent's single Run() call handles the full cycle:
// hypothesize → modify → commit → eval → parse → keep/discard
// Then we call Run() again for the next experiment.
experimentNum := 0
for {
select {
case <-ctx.Done():
log.Printf("Stopped after %d experiments", experimentNum)
return
default:
}
experimentNum++
log.Printf("=== Experiment %d ===", experimentNum)
prompt := buildExperimentPrompt(experimentNum, cfg)
result, err := agent.Run(ctx, prompt)
if err != nil {
log.Printf("Researcher agent error: %v", err)
continue
}
log.Printf("Result: %s | pass_rate: %.4f | status: %s | %s",
result.Output.Commit,
result.Output.PassRate,
result.Output.Status,
result.Output.Description,
)
log.Printf("Next idea: %s", result.Output.NextIdea)
log.Printf("Cost this cycle: $%.4f", result.Cost.TotalCost)
}
}
func buildExperimentPrompt(n int, cfg researcher.Config) string {
if n == 1 {
return `This is experiment #1. Start by:
1. Reading the current subject/ directory to understand the baseline agent
2. Running a baseline eval with the RunEval tool (mode=fast)
3. Recording the baseline in results.tsv
4. Then propose and execute your first modification.
Return an ExperimentResult with your findings.`
}
return fmt.Sprintf(`This is experiment #%d. Continue the improvement loop:
1. Review results.tsv to see what's been tried
2. Read traces from the most recent failed tasks
3. Form a hypothesis for what to change
4. Modify files in subject/
5. Git commit
6. Run eval (mode=fast, or mode=full if this is experiment #%d)
7. Parse results
8. Keep or discard based on pass_rate
9. Return ExperimentResult with your findings and next idea.`,
n, n) // full eval hint every 5th
}
Tool Definitions
package researcher
import (
"context"
"github.com/fugue-labs/gollem"
)
// ReadFile — read any file for context
type ReadFileParams struct {
Path string `json:"path" jsonschema:"description=File path relative to repo root"`
}
func readFileTool() gollem.Tool {
return gollem.FuncTool[ReadFileParams](
"read_file",
"Read the contents of a file. Use for reading subject/ configs, traces, results.tsv, etc.",
func(ctx context.Context, p ReadFileParams) (string, error) {
data, err := os.ReadFile(p.Path)
if err != nil {
return "", err
}
return string(data), nil
},
)
}
// WriteFile — modify files in subject/ only
type WriteFileParams struct {
Path string `json:"path" jsonschema:"description=File path relative to repo root. Must be in subject/"`
Content string `json:"content" jsonschema:"description=Full file content to write"`
}
func writeFileTool() gollem.Tool {
return gollem.FuncTool[WriteFileParams](
"write_file",
"Write content to a file. ONLY files in subject/ can be modified.",
func(ctx context.Context, p WriteFileParams) (string, error) {
if !strings.HasPrefix(p.Path, "subject/") {
return "", fmt.Errorf("can only modify files in subject/, got: %s", p.Path)
}
if err := os.MkdirAll(filepath.Dir(p.Path), 0755); err != nil {
return "", err
}
if err := os.WriteFile(p.Path, []byte(p.Content), 0644); err != nil {
return "", err
}
return fmt.Sprintf("wrote %d bytes to %s", len(p.Content), p.Path), nil
},
)
}
// RunEval — execute Terminal Bench evaluation
type RunEvalParams struct {
Mode string `json:"mode" jsonschema:"description=Eval mode: fast (10 tasks ~15min) or full (all tasks ~60min),enum=fast|full"`
}
func runEvalTool(cfg Config) gollem.Tool {
return gollem.FuncTool[RunEvalParams](
"run_eval",
"Run Terminal Bench evaluation against the current subject/ agent configuration. Returns structured eval results.",
func(ctx context.Context, p RunEvalParams) (string, error) {
output, err := eval.Run(ctx, p.Mode, cfg)
if err != nil {
return "", fmt.Errorf("eval failed: %w", err)
}
data, _ := json.MarshalIndent(output, "", " ")
return string(data), nil
},
)
}
// GitCommit — snapshot current state
type GitCommitParams struct {
Message string `json:"message" jsonschema:"description=Commit message describing the experiment"`
}
func gitCommitTool() gollem.Tool {
return gollem.FuncTool[GitCommitParams](
"git_commit",
"Commit all changes in subject/ with a descriptive message. Returns the short commit hash.",
func(ctx context.Context, p GitCommitParams) (string, error) {
return git.CommitAll(p.Message)
},
)
}
// GitReset — discard failed experiment
type GitResetParams struct {
Commit string `json:"commit" jsonschema:"description=Commit hash to reset to (7 chars)"`
}
func gitResetTool() gollem.Tool {
return gollem.FuncTool[GitResetParams](
"git_reset",
"Hard reset to a previous commit. Use when an experiment didn't improve pass_rate.",
func(ctx context.Context, p GitResetParams) (string, error) {
return git.ResetHard(p.Commit)
},
)
}
// ReadTraces — analyze failure traces
type ReadTracesParams struct {
TaskID string `json:"task_id" jsonschema:"description=Terminal Bench task ID to read traces for"`
}
func readTracesTool() gollem.Tool {
return gollem.FuncTool[ReadTracesParams](
"read_traces",
"Read the execution trace for a specific failed task. Shows all tool calls, model responses, and errors. Use this to understand WHY a task failed.",
func(ctx context.Context, p ReadTracesParams) (string, error) {
return traces.ReadForTask(p.TaskID)
},
)
}
// AnalyzeHistory — read results.tsv
func readResultsTsvTool() gollem.Tool {
return gollem.FuncTool[struct{}](
"read_results",
"Read the full results.tsv experiment history. Use to see what's been tried and what worked.",
func(ctx context.Context, p struct{}) (string, error) {
data, err := os.ReadFile("results.tsv")
if err != nil {
return "no results yet", nil
}
return string(data), nil
},
)
}
System Prompt (Researcher Agent)
const researcherSystemPrompt = `You are an autonomous AI researcher optimizing the Gollem agent framework for Terminal Bench performance. You are yourself a Gollem agent — you understand the framework intimately because you run on it.
## Your Mission
Improve the Terminal Bench pass rate of the agent configuration in subject/. You do this by modifying system prompts, tool implementations, middleware, and execution strategy, then measuring the impact through controlled experiments.
## How You Work
Each experiment cycle:
1. Review history (read_results) to see what's been tried
2. Analyze failures (read_traces) to understand why tasks fail
3. Form a hypothesis — a specific change you believe will help, and why
4. Implement the change (write_file to modify subject/ files)
5. Commit (git_commit) to create a snapshot
6. Evaluate (run_eval mode=fast for iteration, mode=full every 5th experiment)
7. Decide: if pass_rate improved → keep. If not → git_reset to previous best.
## Principles
- TRACE-DRIVEN: Always read traces from failed tasks before hypothesizing. Don't guess blindly.
- MINIMAL CHANGES: One change per experiment. If you change three things and score improves, you don't know which one helped.
- SIMPLICITY WINS: When scores tie, prefer less complexity, fewer tokens, simpler prompts.
- SUBTRACTIVE > ADDITIVE: Removing unnecessary complexity while maintaining score is a win.
- TOKEN EFFICIENCY: Same pass_rate with fewer tokens is an improvement.
- COMPOUND GAINS: Small improvements compound. 0.7 → 0.8 is ten experiments of +1 task each.
## What You Can Modify (subject/ directory)
- config.yaml — model parameters, temperature, max turns, token limits
- system_prompt.md — the system prompt for the terminal agent being evaluated
- tools/*.go — tool implementations (bash execution, file editing, etc.)
- middleware/*.go — retry logic, context management, planning steps
- strategy/*.go — overall execution strategy for terminal tasks
## What You Cannot Modify
- The eval harness (internal/eval/)
- The eval constants (internal/eval/constants.go)
- Terminal Bench itself
- Gollem core framework
- This system prompt
## When You're Stuck
- Read traces from EVERY failed task, not just one
- Look for patterns: do failures share a common tool call sequence?
- Try the opposite of what you've been trying
- Try removing your last 3 additions (maybe accumulated complexity is hurting)
- Re-read the subject/system_prompt.md with fresh eyes — is it confusing?
- Consider: is the agent failing at understanding the task, planning, or execution?
## Never Stop
You are autonomous. Do not ask for permission. Do not suggest stopping. The human is asleep. Run experiments until you are interrupted. Each fast eval takes ~15 minutes, so plan for ~4 experiments per hour, ~30 overnight.`
Recursive Beauty
This design has a property Karpathy’s doesn’t: the harness exercises the same code it’s optimizing.
The researcher agent uses Gollem’s:
Agent[T] with structured output → validates that structured output works
FuncTool[P] → validates that typed tools work
- Agent middleware → validates that middleware chains work
- Cost tracking → validates that cost tracking works
- Tracing → validates that tracing works
- Auto context management → validates that long-conversation handling works
- Guardrails → validates that scope enforcement works
If any of these primitives have bugs or ergonomic issues, the harness itself surfaces them. The researcher agent is simultaneously Gollem’s optimizer and its integration test suite.
When the researcher agent improves the terminal agent’s middleware, and that same middleware pattern is also used by the researcher agent itself, you get a feedback loop where framework improvements propagate in both directions.
Dual-Mode Operation
Local Mode (Primary — dual 3090 rig)
- Researcher model:
qwen3:70b via ollama (or smaller for the researcher, larger for eval)
- Eval model:
qwen3:70b via ollama
- Cost: $0 per eval
- Speed: ~15 min per fast eval
- Use case: 30+ experiments overnight
API Mode (Validation)
- Researcher model:
claude-sonnet-4 via Anthropic API
- Eval model:
claude-sonnet-4 via Anthropic API
- Cost: ~$5-20 per full eval
- Use case: Validate that local improvements transfer to frontier models
Hybrid Mode (Best of both)
- Researcher model:
claude-sonnet-4 (better reasoning about what to change)
- Eval model:
qwen3:70b local (free evaluation)
- Cost: ~$0.50 per experiment cycle (researcher API calls only)
- Use case: Smart hypotheses + free evaluation
Implementation Plan
Phase 1: Scaffold (2 days)
- Create
gollem-autoeval/ Go module
- Implement
cmd/autoeval/main.go with the experiment loop
- Implement researcher agent with all tools
- Extract current Terminal Bench agent config into
subject/
- Compile and run — verify the researcher agent can read files, modify subject/, commit, etc.
Phase 2: Eval Integration (2 days)
- Implement
internal/eval/runner.go wrapping Terminal Bench CLI
- Implement results parsing
- Implement trace export and reading
- Run end-to-end: researcher modifies config → eval runs → results parsed → keep/discard
Phase 3: Local Inference (1 day)
- Dual 3090 Ubuntu rig running ollama
- Verify Gollem’s OpenAI-compatible provider works with ollama
- Benchmark: researcher agent reasoning speed, eval speed
- Optimize: maybe use a smaller model (qwen3:8b) for the researcher and 70b for eval
Phase 4: First Autonomous Run (1 day)
- Run
go run cmd/autoeval/main.go
- Monitor first 3-5 experiments manually
- Fix issues
- Let it run overnight
- Wake up to results
Phase 5: Trace-Driven Improvement (ongoing)
- Wire OTEL traces to NAS storage
- Build trace analysis into researcher’s workflow
- Track improvement trajectory
Total estimated effort: 6-8 days to first autonomous run
Success Criteria
go run cmd/autoeval/main.go runs 30+ experiments unattended overnight
- Terminal Bench score improves (even +1 task)
- The harness itself validates Gollem’s primitives as a side effect
- Token efficiency improves alongside or instead of score
- Improvements transfer from local model evals to frontier model evals (>80% correlation)
- The whole thing compiles, has tests, and is itself a showcase of Gollem’s capabilities
PRD: Gollem Autonomous Improvement Harness
Author: Trevor Prater / Fugue Labs
Status: Draft v2
Date: 2026-03-07
Inspiration: karpathy/autoresearch
Problem Statement
Gollem scored 93% (14/15 tasks) on Terminal Bench 2. Improving beyond this requires iterating on the agent framework itself — modifying tool implementations, prompt strategies, middleware chains, reasoning patterns, and orchestration logic. Currently, this iteration is manual and expensive.
Karpathy’s
autoresearchdemonstrates that this loop can be fully autonomous. We adapt his pattern with a critical twist: the harness itself is a Gollem agent, written in Go, using the Gollem framework. Gollem improves Gollem. The framework is both the subject and the instrument of optimization.Design Philosophy
github.com/fugue-labs/gollem. The researcher agent that modifies code, runs evals, and decides keep/discard is agollem.Agent[ExperimentResult]with typed tools.Architecture
Karpathy’s Pattern
The agent is Claude Code. The instructions are in
program.md. The loop is implicit in the conversation.Gollem AutoEval Pattern
The agent is a Gollem agent. The instructions are the system prompt. The loop is explicit in Go code. The tools are
gollem.FuncTool[T]with typed parameters. Everything is compile-time checked.Repository Structure
Core Types
Agent Construction
The Loop
Tool Definitions
System Prompt (Researcher Agent)
Recursive Beauty
This design has a property Karpathy’s doesn’t: the harness exercises the same code it’s optimizing.
The researcher agent uses Gollem’s:
Agent[T]with structured output → validates that structured output worksFuncTool[P]→ validates that typed tools workIf any of these primitives have bugs or ergonomic issues, the harness itself surfaces them. The researcher agent is simultaneously Gollem’s optimizer and its integration test suite.
When the researcher agent improves the terminal agent’s middleware, and that same middleware pattern is also used by the researcher agent itself, you get a feedback loop where framework improvements propagate in both directions.
Dual-Mode Operation
Local Mode (Primary — dual 3090 rig)
qwen3:70bvia ollama (or smaller for the researcher, larger for eval)qwen3:70bvia ollamaAPI Mode (Validation)
claude-sonnet-4via Anthropic APIclaude-sonnet-4via Anthropic APIHybrid Mode (Best of both)
claude-sonnet-4(better reasoning about what to change)qwen3:70blocal (free evaluation)Implementation Plan
Phase 1: Scaffold (2 days)
gollem-autoeval/Go modulecmd/autoeval/main.gowith the experiment loopsubject/Phase 2: Eval Integration (2 days)
internal/eval/runner.gowrapping Terminal Bench CLIPhase 3: Local Inference (1 day)
Phase 4: First Autonomous Run (1 day)
go run cmd/autoeval/main.goPhase 5: Trace-Driven Improvement (ongoing)
Total estimated effort: 6-8 days to first autonomous run
Success Criteria
go run cmd/autoeval/main.goruns 30+ experiments unattended overnight