| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by shanev 264 days ago
	This is solvable at the level of an individual developer. Write your own benchmark for code problems that you've solved. Verify tests pass and that it satisfies your metrics like tok/s and TTFT. Create a harness that works with API keys or local models (if you're going that route).

5 comments

hamdingers 263 days ago

At the developer level all my LLM use is in the context of agentic wrappers, so my benchmark is fairly trivial:

Configure aider or claude code to use the new model, try to do some work. The benchmark is pass/fail, if after a little while I feel the performance is better than the last model I was using it's a pass, otherwise it's a fail and I go back.

Building your own evaluations makes sense if you're serving an LLM up to customers and want to know how it performs, but if you are the user... use it and see how it goes. It's all subjective anyway.

link

embedding-shape 263 days ago

> Building your own evaluations makes sense if you're serving an LLM up to customers and want to know how it performs, but if you are the user... use it and see how it goes. It's all subjective anyway.

I'd really caution against this approach, mainly because humans suck at removing emotions and other "human" factors when judging how well something works, but also because comparing across models gets a lot easier when you can see 77/100 vs 91/100 as a percentage score, over your own tasks that you actually use the LLMs for. Just don't share this benchmark publicly once you're using it for measurements.

link

hamdingers 263 days ago

So what? I'm the one that's using it, I happen to be a human, my human factor is the only one that matters.

At this point anyone using these LLMs every day have seen those benchmark numbers go up without an appreciable improvement in the day to day experience.

link

embedding-shape 263 days ago

> So what? I'm the one that's using it, I happen to be a human, my human factor is the only one that matters.

Yeah no you're right, if consistency isn't important to you as a human, then it doesn't matter. Personally, I don't trust my "humanness" and correctness is the most important thing for me when working with LLMs, so that's why my benchmarks focus on.

> At this point anyone using these LLMs every day have seen those benchmark numbers go up without an appreciable improvement in the day to day experience.

Yes, this is exactly my point. The benchmarks the makers of these LLMs seems to always provide a better and better score, yet the top scores in my own benchmarks have been more or less the same for the last 1.5 years, and I'm trying every LLM I can come across. These "the best LLM to date!" hardly ever actually is the "best available LLM", and while you could make that judgement by just playing around with LLMs, actually be able to point to specifically why that is, is something at least I find useful, YMMV.

link

cactusplant7374 264 days ago

I think that's what this site is doing: https://aistupidlevel.info/

link

motoboi 264 days ago

Well, openai github is open to write evaluations. Just add your there and guaranteed that the next model will perform better on them.

link

j45 263 days ago

We have to keep in mind that "solving" might mean having the LLM recognize the pattern of solving something.

link

davedx 263 days ago

That’s called evals and yes any serious AI project uses them

link