Hacker News new | ask | show | jobs
by hyperpape 11 days ago
We need a companion to "IN MICE", which is "IN EVALS".

I don't think this is bad research, but you have to understand how far it generalizes. I'm not saying that evals are useless, we need to do our best to produce good benchmarks. But benchmarks are always going to lag pretty far behind real world applications.