| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by notavalleyman 458 days ago
	No, they generally do not compete on accuracy benchmarks afaik. GitHub/openai/simple-evals is what I checked here, and no, openai do not compete on accuracy benchmarks as far as I can tell. So I'd be interested in seeing what led you to think that, and also what led you to earlier claim that anyone typing in the complainant's name saw the same hallucination.

1 comments

ziddoap 458 days ago

>No, they generally do not compete on accuracy benchmarks afaik.

"Get Answers" is literally at the top of ChatGPTs landing page. You think the average person interprets that to mean "Get inaccurate answers"?

Google "AI benchmark" and almost every result is an assessment of the accuracy of various models. What do you think they compete on? How do you think they measure the improvement of one model to the next?

Here's OpenAI's "Optimizing LLM Accuracy" https://platform.openai.com/docs/guides/optimizing-llm-accur...

Pop this in Google and see the pages of results about accuracy: site:openai.com "accuracy". To claim that they don't optimize for accuracy confirms to me that you are not discussing this in good faith. Perhaps you are just trying to be contrarian or something, I don't know.

>and also what led you to earlier claim that anyone typing in the complainant's name saw the same hallucination.

Well, it says right in the article that different people received the same result.

Why are the goalposts moving? Actually, nevermind, I don't care to continue the conversation.

link

notavalleyman 458 days ago

I think if you take a few moments to read carefully.

You'll see that AI companies, including openai, are generally not competing on accuracy benchmarks.

For example, here are the benchmarks on which open ai seem to be trying to compete.

MMLU: Measuring Massive Multitask Language Understanding,

MATH: Measuring Mathematical Problem Solving With the MATH Dataset,

GPQA: A Graduate-Level Google-Proof Q&A Benchmark,

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs,

MGSM: Multilingual Grade School Math Benchmark (MGSM), Language Models are Multilingual Chain-of-Thought Reasoners,

HumanEval: Evaluating Large Language Models Trained on Code,

link

ziddoap 458 days ago

I don't know why I'm bothering. But notice how all of these explicitly mention accuracy? And how they are benchmarking the accuracy of the LLM against a known dataset? How accuracy is the primary metric they are evaluated on? Maybe it's because they are trying to improve the accuracy of the models...

First line of the abstract of MMLU: "We propose a new test to measure a text model's __multitask accuracy__."

Fourth line of the abstract of MATH: "To facilitate future research and __increase accuracy__ on MATH"

Second line of GPQA abstract: "We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach __65% accuracy__ [...] while highly skilled non-expert validators only reach __34% accuracy__"

Fifth line of the DROP abstract: "We apply state-of-the-art methods from both the reading comprehension and semantic parsing literature on this dataset and show that the best systems only achieve 32.7% F1 on __our generalized accuracy metric__"

From the MGSM paper: "MGSM __accuracy__ with different model scales."

Models are designed to output accurate information in a reasonable amount of time. That's literally the whole goal. The entire thing. A math-specific model wants to provide accurate math answers. A general model wants to provide accurate answers to general questions. That's the whole point.

link

notavalleyman 458 days ago

None of those relate to factual accuracy about a guy in norway

link

ziddoap 458 days ago

How much farther can you move the goalposts? We're already almost on another planet.

You ignored almost everything in my original comment and hyper-focused on accuracy. Then, when confronted with the fact that every single example benchmark you provided is a measure of accuracy, you now say "well, it's not a benchmark about a specific person in norway". Obviously not!

The MATH benchmark doesn't ask "what is 2+2", either. Your argument is "well, math-focused models aren't expected to accurately answer 2+2 because it isn't in the MATH benchmark". It's ridiculous.

link