Hacker News new | ask | show | jobs
by yunyu 1207 days ago
Hellaswag is also a deeply flawed benchmark, I wouldn't read too much into it: https://www.surgehq.ai/blog/hellaswag-or-hellabad-36-of-this...