Hacker News new | ask | show | jobs
by nojs 1286 days ago
Here you go:

The HellaSwag benchmark is an example of a large language model (LLM) benchmark that is popular among researchers. However, it has been found to be inaccurate and unhelpful in measuring progress made in LLM research. Researchers analysed the validation set of HellaSwag and found errors in 36% of its rows. They also found that the "Activity Net" rows were particularly problematic. Real-world human evaluation is important in order to make good launch decisions on LLMs.

(summarised by ChatGPT, naturally)