| HN Mirror

[author here] we interviewed the maintainer of that leaderboard if you want to hear from her directly! https://www.latent.space/p/benchmarks-201

tldr: old benchmarks saturated, methodology was liable to a lot of subtle biases. as she mentions on the pod, they're already working on leaderboard v3.