| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by refulgentis 1032 days ago

There weren't any serious examples of degradation.

Does only GPT-4 have to suffer a penalty for HumanEval leaking into training data/RLHF data?

Ignoring those concerns, it fails a reaonable-ness smell test:

We'd have to pretend its the original GPT-4 release from March 2023 until GPT-5 comes out, and only then can OpenAI's work be compared to LLAMA-2 to LLAMA-N.

1 comments

rushingcreek 1032 days ago

There's a couple of things here:

1. I'm not saying we have to wait until GPT-5, we just need an apples-to-apples comparison where contamination is taken into account

2. GPT-4 does not seem to have improved on real-world coding tasks since March, so it's unclear where any purported HumanEval gains could've come from

3. I've personally noticed degradation anecdotally in the GPT-4 June update vs. the original March release

link

lhl 1031 days ago

> 2. GPT-4 does not seem to have improved on real-world coding tasks since March, so it's unclear where any purported HumanEval gains could've come from

Once Markdown formatting is accounted for, the June model improves answers on the Leetcode questions from the LLM Drift paper testing to 70% (35/50) vs the March model's 52% (26/50).

see:

* https://github.com/lchen001/LLMDrift/blob/main/generation/

* https://twitter.com/Si_Boehm/status/1681801371656536068

link

refulgentis 1031 days ago

1. TL;DR: OpenAI must verify HumanEval data wasn't used in training in order to compare it?

2. Link in the post you replied to.

3. Subjectivity is fine by me! There's a motte & bailey flavor to it if we combine your comment and this one, c.f. "This is why we use the official numbers."

link

pclmulqdq 1031 days ago

I think you're assuming that OpenAI is incentivized to benchmark honestly. Like every other company for which a benchmark is a goal, they are not.

link

somenameforme 1031 days ago

Also for a topic like this, subjectivity is all there really is. Even if you create some metric, what you prioritize is going to be subjective. Because performance is going to vary against different sorts of tasks, and there are a literally infinite number of categories of tasks, so it's not like you can ever truly get a fair sampling.

Because of this, a sample of subjective opinions is probably much more valuable than any official metric, especially if that metric comes from, as you mentioned, individuals/orgs who are highly motivated to game it endlessly. Even when it comes from an external source you end up with a similar risk of it being gamed. It's like how old school Google puzzle interviews went from seeing who was most clever [in that domain], to seeing who'd booked up the most.

link

refulgentis 1031 days ago

Well, no, we have the HumanEval results for the June release.

link

somenameforme 1030 days ago

Which is both (1) a subjective selection to measure the effectiveness of various chatbots and (2) now subject to gaming from companies using opaque/closed/inaccessible/unverifiable systems, like OpenAI.

link