|
|
|
|
|
by bisonbear
71 days ago
|
|
a bit heavier weight, but seems worthwhile if working in an org where many people consume the skill: - find N tasks from your repo that serve as good representation of what you want the agent to do with the task
- run agent with old skill/new skill against those tasks
- measure test pass rate / other quality metrics that you care about with skill
- token usage, speed, alignment, ...
- tests aren't a great measure alone - I've found them to be almost bimodal (most models either pass/fail) and not a good differentiator
- use this to make decisions about what to do with the skill - keep skill A, promote skill B, or keep tweaking I've also had success with an "autoresearch" variant of this, where I have my agent run these tests in a loop and optimize for the scores I'm grading o |
|