This document assumes you are either using the upstream repo, or you have already forked BC-Bench and completed the setup (runners, secrets, dataset).
An experiment compares agent performance under different configurations against the same dataset and the same category. Typical examples:
- Toggling custom instructions / skills / a custom agent
- Adding an MCP server (e.g. the AL MCP) and measuring impact
- Comparing models under the same setup
The dataset, category, evaluation pipeline, and result format stay constant. Only src/bcbench/agent/shared/config.yaml (and the files it references) change between experiments.
If you want to evaluate a different kind of output (e.g. code review instead of bug fix), that's a new category, not an experiment — see CATEGORIES.md.
All configurations live in config.yaml:
| Setting | Default | What it does |
|---|---|---|
instructions.enabled |
false |
Copy the entire instructions/<owner>-<repo>/ folder (instructions + skills + agents) into the target repo before running the agent |
skills.enabled |
false |
Copy only instructions/<owner>-<repo>/skills/ |
agents.enabled and agents.name |
false |
Copy only instructions/<owner>-<repo>/agents/ and pass --agent=<name> to the CLI |
mcp.servers |
(none) | List of MCP servers to register |
Note: instructions.enabled: true is a superset — you don't also need to enable skills or agents to get them. Use skills/agents when you want to isolate the effect of just that piece.
Files live under src/bcbench/agent/shared/instructions/<owner>-<repo>/. The folder name mirrors the dataset's repo path with / replaced by - (e.g. microsoft/BCApps -> microsoft-BCApps).
The files checked in today are placeholders. Replace them with whatever you want to test — your own AGENTS.md, your own skills, your own agent definitions — then toggle the corresponding flag in config.yaml.
instructions/
└── microsoft-BCApps/
├── AGENTS.md # renamed at runtime per agent
├── agents/
│ └── ALTest.agent.md
├── skills/
│ └── al-test-generation/
│ └── SKILL.md
└── instructions/
├── codeunits.instructions.md
└── ...At runtime we copy this folder into the target repo:
- Copilot:
<repo>/.github/(AGENTS.md->copilot-instructions.md) - Claude:
<repo>/.claude/(AGENTS.md->CLAUDE.md)
Articulate what you expect to see before triggering anything. A short hypothesis — "enabling custom instructions should improve resolution rate by ~X% because…" — makes it much easier to interpret results and decide whether a follow-up run is worth the cost.
Edit config.yaml, add any instruction/agent/skill files, and open a draft PR using the template below. The PR will not be merged, only serve as an entry point so people can see what exactly is being evaluated.
Before burning CI minutes, run one entry on your machine to confirm the config loads and the agent picks up your instructions/skills/agents:
uv run bcbench run copilot microsoft__BCApps-5633 --category bug-fix --repo-path /path/to/BCAppsThis only generates a patch (no build/test) and finishes in a couple of minutes.
Trigger the evaluation workflow from the Actions tab:
- Workflow:
Evaluation with GitHub CopilotorEvaluation with Claude Code test-run:true(default — runs 4 entries, ~10 min)model,category,al-mcp: as needed
This catches configuration mistakes cheaply. Do not skip it.
Once the test run passes, do one full-dataset run before committing to repeated runs:
test-run:falserepeat:1
Review the summary in the workflow log. If anything looks off (unexpected errors, scores far from prior baselines), investigate before spending more compute.
Agent runs are noisy, so a single number isn't trustworthy. For results you intend to publish or compare:
test-run:falserepeat:5(runs the full dataset 5 times sequentially)
Each run uploads artifacts and updates a leaderboard/<category>/<run_id> branch. Merge that branch to publish to the leaderboard.
- The
summarize-resultsjob prints per-run scores in the Actions log. - Download artifacts locally.
- For deeper analysis, see
notebooks/bug-fix/andnotebooks/test-generation/.
## Experiment Description
### Configuration Changes
- [ ] Custom instructions (`instructions.enabled: true`)
- [ ] Skills (`skills.enabled: true`)
- [ ] Custom agents (`agents.enabled: true`, name: ___)
- [ ] MCP servers (list below)
- [ ] Other (describe)
### Agent & Model
- **Agent:**
- **Model:**
- **Category:** <!-- bug-fix | test-generation | ... -->
### Hypothesis / Expected Outcome
## Notes