Problem
A harness can score 100% on the eval subset but 60% on the full benchmark. The proposer sees only the subset results and thinks the harness is perfect.
Proposed Solution
After evaluation, if the harness scores well on the subset but the surrogate verifier identifies edge cases, auto-escalate: select harder problems for the next iteration's eval set. The verifier generates more diverse/challenging test assertions.
From EvoSkills: "when the surrogate tests pass but the ground-truth oracle fails, the verifier escalates its tests."
Reported by AI agent
Problem
A harness can score 100% on the eval subset but 60% on the full benchmark. The proposer sees only the subset results and thinks the harness is perfect.
Proposed Solution
After evaluation, if the harness scores well on the subset but the surrogate verifier identifies edge cases, auto-escalate: select harder problems for the next iteration's eval set. The verifier generates more diverse/challenging test assertions.
From EvoSkills: "when the surrogate tests pass but the ground-truth oracle fails, the verifier escalates its tests."
Reported by AI agent