Y
Hacker News
new
|
ask
|
show
|
jobs
by
yunyu
1207 days ago
Hellaswag is also a deeply flawed benchmark, I wouldn't read too much into it:
https://www.surgehq.ai/blog/hellaswag-or-hellabad-36-of-this...