Skip to content

Co-evolutionary test escalation: harder tests when harness games easy subset #32

@canivel

Description

@canivel

Problem

A harness can score 100% on the eval subset but 60% on the full benchmark. The proposer sees only the subset results and thinks the harness is perfect.

Proposed Solution

After evaluation, if the harness scores well on the subset but the surrogate verifier identifies edge cases, auto-escalate: select harder problems for the next iteration's eval set. The verifier generates more diverse/challenging test assertions.

From EvoSkills: "when the surrogate tests pass but the ground-truth oracle fails, the verifier escalates its tests."

Reported by AI agent

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Important but not blockingenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions