Hacker News new | ask | show | jobs
by tshadley 1145 days ago
From the article:

"April 3, 2023 - Real Humans Can’t Tell the Difference Between a 13B Open Model and ChatGPT

Berkeley launches Koala, a dialogue model trained entirely using freely available data.

They take the crucial step of measuring real human preferences between their model and ChatGPT. While ChatGPT still holds a slight edge, more than 50% of the time users either prefer Koala or have no preference. Training Cost: $100."

2 comments

Exciting to see open source catching up, but seems to still be a big difference in factuality and common sense between ChatGPT and Koala. Eg:

Who built AlexNet on the ImageNet dataset?

ChatGPT: AlexNet was built by a team of researchers led by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton at the University of Toronto. They entered the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012 and won by a large margin, revolutionizing the field of computer vision with their deep convolutional neural network architecture. The AlexNet architecture consisted of eight layers, including five convolutional layers, two fully connected layers, and a softmax output layer.

Koala: AlexNet was built by Google DeepMind, a London-based artificial intelligence laboratory. The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is an annual large-scale image and video dataset that was created for the ImageNet project by the ImageNet Large Scale Visual Recognition Challenge, which is an annual competition for image and video analysis.

This is hugely misleading. If your bot just memorizes Shakespeare and output segments from memorization, of course nobody can tell the difference. But as soon as you start interacting with them the difference can't be more pronounced.
The test was conducted as such:

>With these two evaluation sets, we conducted a blind pairwise comparison by asking approximately 100 evaluators on Amazon Mechanical Turk platform to compare the quality of model outputs on these held-out sets of prompts. In the ratings interface, we present each rater with an input prompt and the output of two models. They are then asked to judge which output is better (or that they are equally good) using criteria related to response quality and correctness.

No, it's not just memorising shakespeare, real humans interacted with the models and rated them.

That's not what I meant by interaction. The evaluator had to ask the models to do tasks for them that they thought of by their own. Otherwise there are just too many ways that information could have leaked.

OpenAI's model isn't immune from this either, so take any so-called evaluation metrics with a huge grain of salt. This also highlights the difficulties of properly evaluating LLMs: any metrics, once set up, can become a memorization target for LLMs and lose their meaning.