Skip to content

Evals for Spotlight #986

@BYK

Description

@BYK

We want Spotlight to be an indispensable tool for AI-assisted development. For this we need real-world scenarios that we can test Spotlight's usefulness (and also improve/fine-tune). These evals will not only be testing our MCPs but also testing our CLI as that is another way Spotlight can be used by the agents.

Here's the framework we have in mind:

  1. Have some real-world development tasks such as implementing a new feature using Calude Code (or any other tool such as cursor-agent etc).
  2. Invoke the AI assistant with the pre-defined prompt and expect it to use Spotlight CLI or MCP (this needs to be tested)
  3. When the assistant is finished, check its work and pass the test if the feature is implemented correctly.
  4. Crucially: do not include or reveal this final check as it may guide the AI assistent which we don't want under the scenario (unless the scenario itself is a version of TDD)

We should be able to run these locally and on CI continuously or on some schedule.

Available tools:

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions