|
|
|
|
|
by swyx
695 days ago
|
|
[author here] we interviewed the maintainer of that leaderboard if you want to hear from her directly! https://www.latent.space/p/benchmarks-201 tldr: old benchmarks saturated, methodology was liable to a lot of subtle biases. as she mentions on the pod, they're already working on leaderboard v3. |
|