-
-
Notifications
You must be signed in to change notification settings - Fork 32
Open
Description
We want Spotlight to be an indispensable tool for AI-assisted development. For this we need real-world scenarios that we can test Spotlight's usefulness (and also improve/fine-tune). These evals will not only be testing our MCPs but also testing our CLI as that is another way Spotlight can be used by the agents.
Here's the framework we have in mind:
- Have some real-world development tasks such as implementing a new feature using Calude Code (or any other tool such as
cursor-agentetc). - Invoke the AI assistant with the pre-defined prompt and expect it to use Spotlight CLI or MCP (this needs to be tested)
- When the assistant is finished, check its work and pass the test if the feature is implemented correctly.
- Crucially: do not include or reveal this final check as it may guide the AI assistent which we don't want under the scenario (unless the scenario itself is a version of TDD)
We should be able to run these locally and on CI continuously or on some schedule.
Available tools:
Metadata
Metadata
Assignees
Labels
No labels