Evals for Spotlight

We want Spotlight to be an indispensable tool for AI-assisted development. For this we need real-world scenarios that we can test Spotlight's usefulness (and also improve/fine-tune). These evals will not only be testing our MCPs but also testing [our CLI](#958) as that is another way Spotlight can be used by the agents.

Here's the framework we have in mind:

1. Have some real-world development tasks such as implementing a new feature using Calude Code (or any other tool such as `cursor-agent` etc).
2. Invoke the AI assistant with the pre-defined prompt and _expect_ it to use Spotlight CLI or MCP (this needs to be tested)
3. When the assistant is finished, check its work and pass the test if the feature is implemented correctly.
4. Crucially: do not include or reveal this final check as it may guide the AI assistent which we don't want under the scenario (unless the scenario itself is a version of TDD)

We should be able to run these locally and on CI continuously or on some schedule.

Available tools:

 - [Claude Code Agent SDK](https://docs.claude.com/en/api/agent-sdk/overview)
 - [Claude Code Headless Mode](https://docs.claude.com/en/docs/claude-code/headless)
 - [Cursor Agent Headless CLI](https://cursor.com/docs/cli/headless)
 - [Copilot CLI](https://github.com/github/copilot-cli)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Evals for Spotlight #986

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Evals for Spotlight #986

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions