|
|
|
|
|
by nabakin
73 days ago
|
|
It's easy to game and human evaluation data has its trade-offs, but it's way easier to fake public benchmark results. I wish we had a source of high quality private benchmark results across a vast number of models like Lmarena. Having high quality human evaluation data would be a plus too. |
|
[0] https://oobabooga.github.io/benchmark.html