Hacker News new | ask | show | jobs
by sparacha 472 days ago
Leaderboards are getting harder and harder as a decision tool. What does it mean to be better 0.7% or 1.6%. How does that help me? Is higher always better? What are the trade offs? Evals continue be the hardest most important parts of LLMs and tools that use them
1 comments

Totally agreed. Only good as inputs to a stack decision but not the deciding factor. WebVoyager itself feels like it's approaching saturation though as scores as getting high and it only tests on a narrow domain of use-cases. We'll definitely see more challenging and interesting evals pop up in the next little bit.

IMHO, if you're building AI products, most of the time building and running your own evals is the only right way to build something good.

BTW - Arch looks super cool! Just starred and looking forward to playing around with it :)