Coming soon, unit, behavioural and regression tests for your prompts and skills :P
You’ll have:
* Claude model version
* Claude Code prompts and tools
* Your own prompts and skills and whatnot
* Your repository’s source code (= the input)
All of those change constantly, it’s not like it’s some kind of SWE benchmark.
You’ll have:
* Claude model version
* Claude Code prompts and tools
* Your own prompts and skills and whatnot
* Your repository’s source code (= the input)
All of those change constantly, it’s not like it’s some kind of SWE benchmark.