| HN Mirror

Totally agreed. Only good as inputs to a stack decision but not the deciding factor. WebVoyager itself feels like it's approaching saturation though as scores as getting high and it only tests on a narrow domain of use-cases. We'll definitely see more challenging and interesting evals pop up in the next little bit.

IMHO, if you're building AI products, most of the time building and running your own evals is the only right way to build something good.

BTW - Arch looks super cool! Just starred and looking forward to playing around with it :)