Hacker News new | ask | show | jobs
by 0x2c8 1262 days ago
> It should be generally better across a wide range of topics and has improved factuality.

Maybe I am being naive, but this is not telling too much as there is no metric attached to it, so the average user cannot (at least immediately) gauge how impactful this improvement is.

I think it would be beneficial to have benchmarks for assessing the factuality of large language models. Now, several open questions are:

- What is the minimal (necessary) set of questions it must contain?

- Can it ever be sufficient?

- What is the appropriate metric to use?

1 comments

It's hard to think of a benchmark for these LLMs that wouldn't be massively overfit (I don't mean via training, but via hyperparameter tweaking).