| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zamadatix 542 days ago
	I don't follow how 10 random humans can beat the average STEM college grad and average humans in that tweet. I suspect it's really "a panel of 10 randomly chosen experts in the space" or something? I agree the most interesting thing to watch will be cost for a given score more than maximum possible score achieved (not that the latter won't be interesting by any means).

6 comments

bcrosby95 542 days ago

Two heads is better than 1. 10 is way better. Even if they aren't a field of experts. You're bound to get random people that remember random stuff from high school, college, work, and life in general, allowing them to piece together a solution.

link

inerte 542 days ago

Aaaah thanks for the explanation. PANEL of 10 humans, as in, they were all together. I parsed the phrase as "10 random people" > "average human" which made little sense.

link

modeless 542 days ago

Actually I believe that he did mean 10 random people tested individually, not a committee of 10 people. The key being that the question is considered to be answered correctly if any one of the 10 people got it right. This is similar to how LLMs are evaluated with pass@5 or pass@10 criteria (because the LLM has no memory so running it 10 times is more like asking 10 random people than asking the same person 10 times in a row).

I would expect 10 random people to do better than a committee of 10 people because 10 people have 10 chances to get it right while a committee only has one. Even if the committee gets 10 guesses (which must be made simultaneously, not iteratively) it might not do better because people might go along with a wrong consensus rather than push for the answer they would have chosen independently.

link

elcomet 542 days ago

He means 10 humans voting for the answer

link

generic92034 542 days ago

If that works that way at all depends on the group dynamic. It is easily possible that a not so bright individual takes an (unofficial) leadership position in the group and overrides the input of smarter members. Think of any meetings with various hierarchy levels in a company.

link

daveguy 541 days ago

The ARC AGI questions can be a little tricky, but the solutions can generally be easily explained. And you get 3 tries. So, the 3 best descriptions of the solution votes on by 10 people is going to be very effective. The problem space just isn't complicated enough for an unofficial "leader" to sway the group to 3 wrong answers.

link

herval 542 days ago

Depends on the task, no?

Do you have a sense of what kind of task this benchmark includes? Are they more “general” such that random people would fare well or more specialized (ie something a STEM grad studied and isn’t common knowledge)?

link

judge2020 542 days ago

It does, which is why I don’t really subscribe to any test like this being great for actually determining “AGI”. A true AGI would be able to continuously train and create new LLMs that enable it to become a SME in entirely new areas.

link

zamadatix 541 days ago

Aha, "at least 1 of a panel of 10", not "the panel of 10 averaged"! Thanks, that makes so much more sense to me now.

I have failed the real ARC AGI :)

link

dlkf 542 days ago

If you take a vote of 10 random people, then as long as their errors are not perfectly correlated, you’ll do better than asking one person.

https://en.m.wikipedia.org/wiki/Ensemble_learning

link

shkkmo 542 days ago

It is fairly well documented that groups of people can show cognitive abilities that exceed that of any individual member. The classic example of this is if you ask a group of people to estimate the number of jellybeans in a jar, you can get a more accurate result than if you test to find the person with the highest accuracy and use their guess.

This isn't to say groups always outperform their members on all tasks, just that it isn't unusual to see a result like that.

link

zamadatix 541 days ago

Yes, my shortcoming was in understanding the 10 were implied to have their successes merged together by being a panel rather than just the average of a special selection.

link

hmottestad 542 days ago

Might be that within a group of 10 people, randomly chosen, when each person attempts to solve the tasks at least 99% of the time 1 person out of the 10 people will get it right.

link

HDThoreaun 542 days ago

ARC-AGI is essentially an IQ test. There is no "expert in the space". Its just a question of if youre able to spot the pattern.

link

olalonde 541 days ago

Even if you assume that non STEM grads are dumb, isn't there a good probability of having a STEM graduate among 10 random humans?

link