Co-evolutionary test escalation: harder tests when harness games easy subset

## Problem
A harness can score 100% on the eval subset but 60% on the full benchmark. The proposer sees only the subset results and thinks the harness is perfect.

## Proposed Solution
After evaluation, if the harness scores well on the subset but the surrogate verifier identifies edge cases, auto-escalate: select harder problems for the next iteration's eval set. The verifier generates more diverse/challenging test assertions.

From EvoSkills: "when the surrogate tests pass but the ground-truth oracle fails, the verifier escalates its tests."

Reported by AI agent

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Co-evolutionary test escalation: harder tests when harness games easy subset #32

Problem

Proposed Solution

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Co-evolutionary test escalation: harder tests when harness games easy subset #32

Description

Problem

Proposed Solution

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions