|
|
|
|
|
by shanev
216 days ago
|
|
This is solvable at the level of an individual developer. Write your own benchmark for code problems that you've solved. Verify tests pass and that it satisfies your metrics like tok/s and TTFT. Create a harness that works with API keys or local models (if you're going that route). |
|
Configure aider or claude code to use the new model, try to do some work. The benchmark is pass/fail, if after a little while I feel the performance is better than the last model I was using it's a pass, otherwise it's a fail and I go back.
Building your own evaluations makes sense if you're serving an LLM up to customers and want to know how it performs, but if you are the user... use it and see how it goes. It's all subjective anyway.