Skip to content

Gollem Autonomous Improvement Harness #32

@trevorprater

Description

@trevorprater

PRD: Gollem Autonomous Improvement Harness

Author: Trevor Prater / Fugue Labs
Status: Draft v2
Date: 2026-03-07
Inspiration: karpathy/autoresearch


Problem Statement

Gollem scored 93% (14/15 tasks) on Terminal Bench 2. Improving beyond this requires iterating on the agent framework itself — modifying tool implementations, prompt strategies, middleware chains, reasoning patterns, and orchestration logic. Currently, this iteration is manual and expensive.

Karpathy’s autoresearch demonstrates that this loop can be fully autonomous. We adapt his pattern with a critical twist: the harness itself is a Gollem agent, written in Go, using the Gollem framework. Gollem improves Gollem. The framework is both the subject and the instrument of optimization.


Design Philosophy

  1. Gollem all the way down. The harness is a Go program using github.com/fugue-labs/gollem. The researcher agent that modifies code, runs evals, and decides keep/discard is a gollem.Agent[ExperimentResult] with typed tools.
  2. Single metric. Terminal Bench task pass rate.
  3. Fixed eval budget. Each eval run uses the same task set, same model, same time limits.
  4. Git-based versioning. Every experiment is a commit. Keep = advance. Discard = reset.
  5. Never stop. The agent runs indefinitely until manually interrupted.
  6. Eat your own dog food. Every Gollem primitive — middleware, guardrails, cost tracking, tracing, structured output — is exercised by the harness itself. If the framework has a gap, the harness exposes it.

Architecture

Karpathy’s Pattern

Human → "read program.md and start" → Claude Code
    Claude Code modifies train.py
    Claude Code runs `uv run train.py`
    Claude Code parses results
    Claude Code keeps/discards
    LOOP

The agent is Claude Code. The instructions are in program.md. The loop is implicit in the conversation.

Gollem AutoEval Pattern

Human → `go run cmd/autoeval/main.go`
    Gollem Agent[ExperimentResult] runs autonomously
    ├── Tool: ReadFile (read agent config, traces, results)
    ├── Tool: WriteFile (modify agent config, prompts, tools)
    ├── Tool: GitCommit (snapshot experiment)
    ├── Tool: RunEval (execute Terminal Bench subset)
    ├── Tool: ParseResults (extract pass_rate from results.json)
    ├── Tool: GitReset (discard failed experiment)
    ├── Tool: ReadTraces (analyze why tasks failed)
    └── Tool: AnalyzeHistory (review results.tsv for patterns)
    LOOP via Gollem run condition (never terminates)

The agent is a Gollem agent. The instructions are the system prompt. The loop is explicit in Go code. The tools are gollem.FuncTool[T] with typed parameters. Everything is compile-time checked.


Repository Structure

gollem-autoeval/
├── cmd/
│   └── autoeval/
│       └── main.go              ← Entry point: constructs and runs the researcher agent
├── internal/
│   ├── researcher/
│   │   ├── agent.go             ← Researcher agent construction (system prompt, tools, middleware)
│   │   ├── tools.go             ← Tool definitions: ReadFile, WriteFile, RunEval, Git*, etc.
│   │   ├── types.go             ← ExperimentResult, EvalOutput, TraceAnalysis structs
│   │   └── prompts.go           ← System prompt and experiment strategy instructions
│   ├── eval/
│   │   ├── runner.go            ← Terminal Bench eval execution wrapper
│   │   ├── parser.go            ← Results.json parser
│   │   └── constants.go         ← Fixed eval parameters (DO NOT MODIFY)
│   └── git/
│       └── git.go               ← Git operations: commit, reset, branch, log
├── subject/                     ← THE SCOPE OF MODIFICATION (what the agent optimizes)
│   ├── config.yaml              ← Agent configuration for Terminal Bench
│   ├── system_prompt.md         ← System prompt for the terminal agent
│   ├── tools/                   ← Tool implementations
│   │   ├── bash.go
│   │   ├── file_edit.go
│   │   ├── file_read.go
│   │   └── search.go
│   ├── middleware/               ← Middleware chain
│   │   ├── retry.go
│   │   ├── context.go
│   │   └── planning.go
│   └── strategy/                ← Execution strategy
│       └── terminal.go
├── results.tsv                  ← Experiment log
├── traces/                      ← OTEL traces from each eval run
├── go.mod
└── go.sum

Core Types

package researcher

import "github.com/fugue-labs/gollem"

// ExperimentResult is the structured output of each experiment cycle.
// The researcher agent must produce this after every eval.
type ExperimentResult struct {
    Commit      string  `json:"commit" jsonschema:"description=Git commit hash (7 chars)"`
    PassRate    float64 `json:"pass_rate" jsonschema:"description=Task pass rate 0.0-1.0"`
    TokensUsed  int     `json:"tokens_used" jsonschema:"description=Total tokens consumed"`
    Status      string  `json:"status" jsonschema:"enum=keep|discard|crash"`
    Description string  `json:"description" jsonschema:"description=What this experiment tried"`
    Hypothesis  string  `json:"hypothesis" jsonschema:"description=Why you expected this to work"`
    NextIdea    string  `json:"next_idea" jsonschema:"description=What to try next based on results"`
}

// EvalOutput is parsed from Terminal Bench results.
type EvalOutput struct {
    PassRate     float64  `json:"pass_rate"`
    TasksPassed  int      `json:"tasks_passed"`
    TasksTotal   int      `json:"tasks_total"`
    TasksFailed  []string `json:"tasks_failed"`
    DurationSecs float64  `json:"duration_secs"`
    TokensUsed   int      `json:"tokens_used"`
}

// TraceAnalysis is what the agent produces after reading failure traces.
type TraceAnalysis struct {
    TaskID       string `json:"task_id"`
    FailureMode  string `json:"failure_mode"`
    RootCause    string `json:"root_cause"`
    SuggestedFix string `json:"suggested_fix"`
}

Agent Construction

package researcher

import (
    "github.com/fugue-labs/gollem"
    "github.com/fugue-labs/gollem/provider/anthropic"
    // or for local inference:
    // "github.com/fugue-labs/gollem/provider/openai" // ollama is openai-compatible
)

func NewResearcherAgent(cfg Config) *gollem.Agent[ExperimentResult] {
    model := selectModel(cfg) // anthropic.New() or openai-compat ollama

    tracker := gollem.NewCostTracker(modelPricing)

    return gollem.NewAgent[ExperimentResult](model,
        // Identity
        gollem.WithSystemPrompt[ExperimentResult](researcherSystemPrompt),

        // Tools — the researcher's capabilities
        gollem.WithTools[ExperimentResult](
            readFileTool(),        // Read any file in subject/ or traces/
            writeFileTool(),       // Write/modify files in subject/ only
            runEvalTool(cfg),      // Execute Terminal Bench eval
            parseResultsTool(),    // Parse eval output
            gitCommitTool(),       // Commit current subject/ state
            gitResetTool(),        // Reset to previous best commit
            gitLogTool(),          // View experiment history
            readTracesTool(),      // Read OTEL traces from failed tasks
            readResultsTsvTool(),  // Read the results log
            bashTool(),            // Escape hatch: run arbitrary commands
        ),

        // Safety
        gollem.WithTurnGuardrail[ExperimentResult]("max_turns",
            gollem.MaxTurns(100), // per experiment cycle
        ),
        gollem.WithInputGuardrail[ExperimentResult]("scope",
            scopeGuardrail(), // prevent writes outside subject/
        ),

        // Observability
        gollem.WithCostTracker[ExperimentResult](tracker),
        gollem.WithTracing[ExperimentResult](),
        gollem.WithTraceExporter[ExperimentResult](
            gollem.NewJSONFileExporter("./harness-traces"),
        ),
        gollem.WithHooks[ExperimentResult](gollem.Hook{
            OnToolStart: func(ctx context.Context, rc *gollem.RunContext, name, args string) {
                log.Printf("[researcher] tool: %s", name)
            },
        }),

        // Middleware
        gollem.WithAgentMiddleware[ExperimentResult](
            gollem.TimingMiddleware(func(d time.Duration) {
                log.Printf("[researcher] model call: %v", d)
            }),
        ),
        gollem.WithAgentMiddleware[ExperimentResult](
            gollem.LoggingMiddleware(log.Printf),
        ),

        // Context management — the researcher will have long conversations
        gollem.WithAutoContext[ExperimentResult](gollem.AutoContextConfig{
            MaxTokens: 100000,
            KeepLastN: 20,
        }),
    )
}

The Loop

package main

import (
    "context"
    "log"
    "os"
    "os/signal"

    "github.com/fugue-labs/gollem-autoeval/internal/researcher"
)

func main() {
    cfg := researcher.LoadConfig()
    agent := researcher.NewResearcherAgent(cfg)

    // Graceful shutdown on interrupt
    ctx, cancel := context.WithCancel(context.Background())
    defer cancel()
    
    sigCh := make(chan os.Signal, 1)
    signal.Notify(sigCh, os.Interrupt)
    go func() {
        <-sigCh
        log.Println("Interrupt received, finishing current experiment...")
        cancel()
    }()

    // The outer loop: each iteration is one full experiment cycle.
    // The agent's single Run() call handles the full cycle:
    //   hypothesize → modify → commit → eval → parse → keep/discard
    // Then we call Run() again for the next experiment.
    experimentNum := 0
    for {
        select {
        case <-ctx.Done():
            log.Printf("Stopped after %d experiments", experimentNum)
            return
        default:
        }

        experimentNum++
        log.Printf("=== Experiment %d ===", experimentNum)

        prompt := buildExperimentPrompt(experimentNum, cfg)
        result, err := agent.Run(ctx, prompt)
        if err != nil {
            log.Printf("Researcher agent error: %v", err)
            continue
        }

        log.Printf("Result: %s | pass_rate: %.4f | status: %s | %s",
            result.Output.Commit,
            result.Output.PassRate,
            result.Output.Status,
            result.Output.Description,
        )
        log.Printf("Next idea: %s", result.Output.NextIdea)
        log.Printf("Cost this cycle: $%.4f", result.Cost.TotalCost)
    }
}

func buildExperimentPrompt(n int, cfg researcher.Config) string {
    if n == 1 {
        return `This is experiment #1. Start by:
1. Reading the current subject/ directory to understand the baseline agent
2. Running a baseline eval with the RunEval tool (mode=fast)
3. Recording the baseline in results.tsv
4. Then propose and execute your first modification.
Return an ExperimentResult with your findings.`
    }

    return fmt.Sprintf(`This is experiment #%d. Continue the improvement loop:
1. Review results.tsv to see what's been tried
2. Read traces from the most recent failed tasks
3. Form a hypothesis for what to change
4. Modify files in subject/
5. Git commit
6. Run eval (mode=fast, or mode=full if this is experiment #%d)
7. Parse results
8. Keep or discard based on pass_rate
9. Return ExperimentResult with your findings and next idea.`, 
        n, n) // full eval hint every 5th
}

Tool Definitions

package researcher

import (
    "context"
    "github.com/fugue-labs/gollem"
)

// ReadFile — read any file for context
type ReadFileParams struct {
    Path string `json:"path" jsonschema:"description=File path relative to repo root"`
}

func readFileTool() gollem.Tool {
    return gollem.FuncTool[ReadFileParams](
        "read_file",
        "Read the contents of a file. Use for reading subject/ configs, traces, results.tsv, etc.",
        func(ctx context.Context, p ReadFileParams) (string, error) {
            data, err := os.ReadFile(p.Path)
            if err != nil {
                return "", err
            }
            return string(data), nil
        },
    )
}

// WriteFile — modify files in subject/ only
type WriteFileParams struct {
    Path    string `json:"path" jsonschema:"description=File path relative to repo root. Must be in subject/"`
    Content string `json:"content" jsonschema:"description=Full file content to write"`
}

func writeFileTool() gollem.Tool {
    return gollem.FuncTool[WriteFileParams](
        "write_file",
        "Write content to a file. ONLY files in subject/ can be modified.",
        func(ctx context.Context, p WriteFileParams) (string, error) {
            if !strings.HasPrefix(p.Path, "subject/") {
                return "", fmt.Errorf("can only modify files in subject/, got: %s", p.Path)
            }
            if err := os.MkdirAll(filepath.Dir(p.Path), 0755); err != nil {
                return "", err
            }
            if err := os.WriteFile(p.Path, []byte(p.Content), 0644); err != nil {
                return "", err
            }
            return fmt.Sprintf("wrote %d bytes to %s", len(p.Content), p.Path), nil
        },
    )
}

// RunEval — execute Terminal Bench evaluation
type RunEvalParams struct {
    Mode string `json:"mode" jsonschema:"description=Eval mode: fast (10 tasks ~15min) or full (all tasks ~60min),enum=fast|full"`
}

func runEvalTool(cfg Config) gollem.Tool {
    return gollem.FuncTool[RunEvalParams](
        "run_eval",
        "Run Terminal Bench evaluation against the current subject/ agent configuration. Returns structured eval results.",
        func(ctx context.Context, p RunEvalParams) (string, error) {
            output, err := eval.Run(ctx, p.Mode, cfg)
            if err != nil {
                return "", fmt.Errorf("eval failed: %w", err)
            }
            data, _ := json.MarshalIndent(output, "", "  ")
            return string(data), nil
        },
    )
}

// GitCommit — snapshot current state
type GitCommitParams struct {
    Message string `json:"message" jsonschema:"description=Commit message describing the experiment"`
}

func gitCommitTool() gollem.Tool {
    return gollem.FuncTool[GitCommitParams](
        "git_commit",
        "Commit all changes in subject/ with a descriptive message. Returns the short commit hash.",
        func(ctx context.Context, p GitCommitParams) (string, error) {
            return git.CommitAll(p.Message)
        },
    )
}

// GitReset — discard failed experiment
type GitResetParams struct {
    Commit string `json:"commit" jsonschema:"description=Commit hash to reset to (7 chars)"`
}

func gitResetTool() gollem.Tool {
    return gollem.FuncTool[GitResetParams](
        "git_reset",
        "Hard reset to a previous commit. Use when an experiment didn't improve pass_rate.",
        func(ctx context.Context, p GitResetParams) (string, error) {
            return git.ResetHard(p.Commit)
        },
    )
}

// ReadTraces — analyze failure traces
type ReadTracesParams struct {
    TaskID string `json:"task_id" jsonschema:"description=Terminal Bench task ID to read traces for"`
}

func readTracesTool() gollem.Tool {
    return gollem.FuncTool[ReadTracesParams](
        "read_traces",
        "Read the execution trace for a specific failed task. Shows all tool calls, model responses, and errors. Use this to understand WHY a task failed.",
        func(ctx context.Context, p ReadTracesParams) (string, error) {
            return traces.ReadForTask(p.TaskID)
        },
    )
}

// AnalyzeHistory — read results.tsv
func readResultsTsvTool() gollem.Tool {
    return gollem.FuncTool[struct{}](
        "read_results",
        "Read the full results.tsv experiment history. Use to see what's been tried and what worked.",
        func(ctx context.Context, p struct{}) (string, error) {
            data, err := os.ReadFile("results.tsv")
            if err != nil {
                return "no results yet", nil
            }
            return string(data), nil
        },
    )
}

System Prompt (Researcher Agent)

const researcherSystemPrompt = `You are an autonomous AI researcher optimizing the Gollem agent framework for Terminal Bench performance. You are yourself a Gollem agent — you understand the framework intimately because you run on it.

## Your Mission

Improve the Terminal Bench pass rate of the agent configuration in subject/. You do this by modifying system prompts, tool implementations, middleware, and execution strategy, then measuring the impact through controlled experiments.

## How You Work

Each experiment cycle:
1. Review history (read_results) to see what's been tried
2. Analyze failures (read_traces) to understand why tasks fail
3. Form a hypothesis — a specific change you believe will help, and why
4. Implement the change (write_file to modify subject/ files)
5. Commit (git_commit) to create a snapshot
6. Evaluate (run_eval mode=fast for iteration, mode=full every 5th experiment)
7. Decide: if pass_rate improved → keep. If not → git_reset to previous best.

## Principles

- TRACE-DRIVEN: Always read traces from failed tasks before hypothesizing. Don't guess blindly.
- MINIMAL CHANGES: One change per experiment. If you change three things and score improves, you don't know which one helped.
- SIMPLICITY WINS: When scores tie, prefer less complexity, fewer tokens, simpler prompts.
- SUBTRACTIVE > ADDITIVE: Removing unnecessary complexity while maintaining score is a win.
- TOKEN EFFICIENCY: Same pass_rate with fewer tokens is an improvement.
- COMPOUND GAINS: Small improvements compound. 0.7 → 0.8 is ten experiments of +1 task each.

## What You Can Modify (subject/ directory)

- config.yaml — model parameters, temperature, max turns, token limits
- system_prompt.md — the system prompt for the terminal agent being evaluated
- tools/*.go — tool implementations (bash execution, file editing, etc.)
- middleware/*.go — retry logic, context management, planning steps
- strategy/*.go — overall execution strategy for terminal tasks

## What You Cannot Modify

- The eval harness (internal/eval/)
- The eval constants (internal/eval/constants.go)
- Terminal Bench itself
- Gollem core framework
- This system prompt

## When You're Stuck

- Read traces from EVERY failed task, not just one
- Look for patterns: do failures share a common tool call sequence?
- Try the opposite of what you've been trying
- Try removing your last 3 additions (maybe accumulated complexity is hurting)
- Re-read the subject/system_prompt.md with fresh eyes — is it confusing?
- Consider: is the agent failing at understanding the task, planning, or execution?

## Never Stop

You are autonomous. Do not ask for permission. Do not suggest stopping. The human is asleep. Run experiments until you are interrupted. Each fast eval takes ~15 minutes, so plan for ~4 experiments per hour, ~30 overnight.`

Recursive Beauty

This design has a property Karpathy’s doesn’t: the harness exercises the same code it’s optimizing.

The researcher agent uses Gollem’s:

  • Agent[T] with structured output → validates that structured output works
  • FuncTool[P] → validates that typed tools work
  • Agent middleware → validates that middleware chains work
  • Cost tracking → validates that cost tracking works
  • Tracing → validates that tracing works
  • Auto context management → validates that long-conversation handling works
  • Guardrails → validates that scope enforcement works

If any of these primitives have bugs or ergonomic issues, the harness itself surfaces them. The researcher agent is simultaneously Gollem’s optimizer and its integration test suite.

When the researcher agent improves the terminal agent’s middleware, and that same middleware pattern is also used by the researcher agent itself, you get a feedback loop where framework improvements propagate in both directions.


Dual-Mode Operation

Local Mode (Primary — dual 3090 rig)

  • Researcher model: qwen3:70b via ollama (or smaller for the researcher, larger for eval)
  • Eval model: qwen3:70b via ollama
  • Cost: $0 per eval
  • Speed: ~15 min per fast eval
  • Use case: 30+ experiments overnight

API Mode (Validation)

  • Researcher model: claude-sonnet-4 via Anthropic API
  • Eval model: claude-sonnet-4 via Anthropic API
  • Cost: ~$5-20 per full eval
  • Use case: Validate that local improvements transfer to frontier models

Hybrid Mode (Best of both)

  • Researcher model: claude-sonnet-4 (better reasoning about what to change)
  • Eval model: qwen3:70b local (free evaluation)
  • Cost: ~$0.50 per experiment cycle (researcher API calls only)
  • Use case: Smart hypotheses + free evaluation

Implementation Plan

Phase 1: Scaffold (2 days)

  • Create gollem-autoeval/ Go module
  • Implement cmd/autoeval/main.go with the experiment loop
  • Implement researcher agent with all tools
  • Extract current Terminal Bench agent config into subject/
  • Compile and run — verify the researcher agent can read files, modify subject/, commit, etc.

Phase 2: Eval Integration (2 days)

  • Implement internal/eval/runner.go wrapping Terminal Bench CLI
  • Implement results parsing
  • Implement trace export and reading
  • Run end-to-end: researcher modifies config → eval runs → results parsed → keep/discard

Phase 3: Local Inference (1 day)

  • Dual 3090 Ubuntu rig running ollama
  • Verify Gollem’s OpenAI-compatible provider works with ollama
  • Benchmark: researcher agent reasoning speed, eval speed
  • Optimize: maybe use a smaller model (qwen3:8b) for the researcher and 70b for eval

Phase 4: First Autonomous Run (1 day)

  • Run go run cmd/autoeval/main.go
  • Monitor first 3-5 experiments manually
  • Fix issues
  • Let it run overnight
  • Wake up to results

Phase 5: Trace-Driven Improvement (ongoing)

  • Wire OTEL traces to NAS storage
  • Build trace analysis into researcher’s workflow
  • Track improvement trajectory

Total estimated effort: 6-8 days to first autonomous run


Success Criteria

  1. go run cmd/autoeval/main.go runs 30+ experiments unattended overnight
  2. Terminal Bench score improves (even +1 task)
  3. The harness itself validates Gollem’s primitives as a side effect
  4. Token efficiency improves alongside or instead of score
  5. Improvements transfer from local model evals to frontier model evals (>80% correlation)
  6. The whole thing compiles, has tests, and is itself a showcase of Gollem’s capabilities

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions