| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by crazyedgar 1139 days ago
	This is hugely misleading. If your bot just memorizes Shakespeare and output segments from memorization, of course nobody can tell the difference. But as soon as you start interacting with them the difference can't be more pronounced.

1 comments

e63f67dd-065b 1139 days ago

The test was conducted as such:

>With these two evaluation sets, we conducted a blind pairwise comparison by asking approximately 100 evaluators on Amazon Mechanical Turk platform to compare the quality of model outputs on these held-out sets of prompts. In the ratings interface, we present each rater with an input prompt and the output of two models. They are then asked to judge which output is better (or that they are equally good) using criteria related to response quality and correctness.

No, it's not just memorising shakespeare, real humans interacted with the models and rated them.

crazyedgar 1139 days ago

That's not what I meant by interaction. The evaluator had to ask the models to do tasks for them that they thought of by their own. Otherwise there are just too many ways that information could have leaked.

OpenAI's model isn't immune from this either, so take any so-called evaluation metrics with a huge grain of salt. This also highlights the difficulties of properly evaluating LLMs: any metrics, once set up, can become a memorization target for LLMs and lose their meaning.