|
|
|
|
|
by hamdingers
219 days ago
|
|
At the developer level all my LLM use is in the context of agentic wrappers, so my benchmark is fairly trivial: Configure aider or claude code to use the new model, try to do some work. The benchmark is pass/fail, if after a little while I feel the performance is better than the last model I was using it's a pass, otherwise it's a fail and I go back. Building your own evaluations makes sense if you're serving an LLM up to customers and want to know how it performs, but if you are the user... use it and see how it goes. It's all subjective anyway. |
|
I'd really caution against this approach, mainly because humans suck at removing emotions and other "human" factors when judging how well something works, but also because comparing across models gets a lot easier when you can see 77/100 vs 91/100 as a percentage score, over your own tasks that you actually use the LLMs for. Just don't share this benchmark publicly once you're using it for measurements.