Hacker News new | ask | show | jobs
by ohso4 457 days ago
Lmarena.ai is a very accurate eval (with stylecontrol). Other benchmarks like AIME and whatever can be trained on/optimized for and therefore should not be trusted. Most ai companies do something fishy to boost their benchmark scores.