|
|
|
|
|
by sparacha
472 days ago
|
|
Leaderboards are getting harder and harder as a decision tool. What does it mean to be better 0.7% or 1.6%. How does that help me? Is higher always better? What are the trade offs? Evals continue be the hardest most important parts of LLMs and tools that use them |
|
IMHO, if you're building AI products, most of the time building and running your own evals is the only right way to build something good.
BTW - Arch looks super cool! Just starred and looking forward to playing around with it :)