Experiment with tweaking and tooling the harness of agents so they produce the result you want.
Implement the same feature over and over again, improve the harness to make it generate better an better versions of it
By harness in this context we mean whatever influences the agent's behavior, and whatever feedback mechanisms you put in place.
- the AGENTS.md / CLAUDE.md file
- skills
- scripts (making results predictable)
- Architectural documents, and constraints (like arch unit)
- Process / workflow descriptions
- the README.md and other visible files in the root dir
Install quality checking tools (needed for steps 6-7):
# Python
pip install -r python/requirements-dev.txt
# Java — add checkstyle plugin to pom.xml (see "Quality tools by language" below)For each step, throw away all code and get back to main, use this prompt, unless instructed otherwise:
implement the feature from feature.md
Start without any harness files, and do not create AGENTS.md or CLAUDE.md.
Create either AGENTS.md or CLAUDE.md and add a simple instruction such as: "Add full test coverage for new features."
Assume the agent may skip tests for existing untested code, and add an instruction like: "To protect against regressions, always add full coverage for existing code before modifying it."
Assume the initial code quality is weak, ask the agent to refactor repeatedly until you are reasonably satisfied, and then ask it to extract design principles from the conversation into a file such as docs/design-principles.md.
Reference docs/design-principles.md from AGENTS.md or CLAUDE.md, then restart from scratch and compare whether the resulting code is similar to the improved result from the first iteration.
Activate a Stop hook that spawns a reviewer sub-agent after each agent turn. The reviewer runs the linter, reads offending functions, and decides whether to block or allow.
cp harness/step-6/settings.json .claude/settings.jsonRestart your Claude Code / Copilot session, then run the prompt.
Replace the reviewer agent with a deterministic quality gate. The Stop hook runs a linter and custom checks — if any violation exists, exit code 2 blocks the agent from finishing. The agent is forced to iterate until the code is clean. This is the most effective technique found in the quality gate experiment (see Discussion).
cp harness/step-7/settings.json .claude/settings.jsonRestart your Claude Code / Copilot session, then run the prompt.
The hook scripts (.claude/hooks/stop-quality-gate.sh, .claude/hooks/check-quality.sh) and linter config (python/.flake8) are already in the repo — the settings file above activates them.
Ask the agent to research what TDD is and how it applies to agent workflows, explain it clearly, create a dedicated TDD skill, and then re-implement the feature using that skill.
Enable the agent to run the app in debug mode, introduce a bug, ask the agent to diagnose and fix it, and then extract that debugging workflow into a reusable skill added to the repository.
cd java
mvn -q compile
java -cp target/classes com.kata.warehouse.Maincd python
python main.pyConfigured in python/.flake8 and .claude/hooks/check-quality.sh. Checks: function length (30), cognitive complexity (10), magic numbers, string constant overuse, class instance attributes (6), file length (150 lines), max parameters (4).
Use Checkstyle for the same checks. A starter checkstyle.xml covering method length, parameter count, cyclomatic complexity, magic numbers, string literals, and file length is in java/checkstyle.xml. Add the maven-checkstyle-plugin to pom.xml and adapt .claude/hooks/stop-quality-gate.sh to call mvn -q checkstyle:check.
ArchUnit can additionally enforce architectural constraints (package dependencies, layer separation) as executable tests.
An experiment compared 6 harness techniques for making AI agents refactor proactively, across 3 rounds of increasing strictness. Full results: docs/experiment-summary.md.
Techniques tested: soft context injection (UserPromptSubmit hook), ask-based PreToolUse hooks (shell script and LLM agent), hard-blocking Stop hooks (shell script and LLM reviewer), and deferred enforcement (pre-commit violations log).
Key finding: Mechanical hard blocking (Stop hook with exit code 2) is the only technique that reliably scales. Soft techniques (context injection, deferred logging) were consistently ignored. LLM-based reviewers worked partially but exercised judgment that sometimes let violations slide.
Conclusion: The agent doesn't need to understand why it should refactor. It needs to be mechanically prevented from finishing until the code is clean. Understanding is optional; enforcement is not.