| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zhisbug 1219 days ago
	then do you have better way to more rigorously evaluate chatbot at the presence of LLMs like ChatGPT trained on almost all Internet data?

1 comments

cscurmudgeon 1219 days ago

> running it through a _single_ of the many openly available language model benchmarks.

link

zhisbug 1219 days ago

but how could you guarantee the pretrained model haven't seen those benchmarks? And the baselines you are comparing to (chatgpt, bard) haven't as well? Cuz those benchmarking datasets are also collected from Internet right?

link

CGamesPlay 1219 days ago

As a baseline, don't! If it performs horribly on the test and it cheated, that's even worse than if it fails the test and didn't cheat. So the benchmark score gives you an upper bound on performance.

link