| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by refulgentis 1030 days ago

1. TL;DR: OpenAI must verify HumanEval data wasn't used in training in order to compare it?

2. Link in the post you replied to.

3. Subjectivity is fine by me! There's a motte & bailey flavor to it if we combine your comment and this one, c.f. "This is why we use the official numbers."

1 comments

pclmulqdq 1030 days ago

I think you're assuming that OpenAI is incentivized to benchmark honestly. Like every other company for which a benchmark is a goal, they are not.

link

somenameforme 1030 days ago

Also for a topic like this, subjectivity is all there really is. Even if you create some metric, what you prioritize is going to be subjective. Because performance is going to vary against different sorts of tasks, and there are a literally infinite number of categories of tasks, so it's not like you can ever truly get a fair sampling.

Because of this, a sample of subjective opinions is probably much more valuable than any official metric, especially if that metric comes from, as you mentioned, individuals/orgs who are highly motivated to game it endlessly. Even when it comes from an external source you end up with a similar risk of it being gamed. It's like how old school Google puzzle interviews went from seeing who was most clever [in that domain], to seeing who'd booked up the most.

link

refulgentis 1029 days ago

Well, no, we have the HumanEval results for the June release.

link

somenameforme 1029 days ago

Which is both (1) a subjective selection to measure the effectiveness of various chatbots and (2) now subject to gaming from companies using opaque/closed/inaccessible/unverifiable systems, like OpenAI.

link