| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cosmojg 946 days ago
	> In the first batch of participants collected via Amazon Mechanical Turk, each received 11 problems (this batch also only had two “minimal Problems,” as opposed to three such problems for everyone else). However, preliminary data examination showed that some participants did not fully follow the study instructions and had to be excluded (see Section 5.2). If they stuck to the average Mechanical Turk worker instead of filtering for "Master Workers," the parent's conclusions likely would've aligned with those of the study. Unfortunately, it seems the authors threw out the only data that didn't support their hypothesis as GPT-4 did, in fact, outperform the median Mechanical Turk worker, particularly in terms of instruction following.

1 comments

nrfulton 946 days ago

> Unfortunately, it seems the authors threw out the only data that didn't support their hypothesis as GPT-4 did, in fact, outperform the median Mechanical Turk worker, particularly in terms of instruction following.

MTurk, to first approximate, is a marketplace that pays people pennies to fill out web forms. The obvious thing happens. The median Mechanical Turk worker probably either isn't a human, isn't just a (single) human, and/or is a (single) human but is barely paying attention + possibly using macros. Or even just button mashing.

That was true even before GPT-2. Tricks like attention checks and task-specific subtle captcha checks have been around for almost as long as the platform itself. Vaguely psychometric tasks such as ARC are particularly difficult -- designing hardened MTurk protocols in that regime is a fucking nightmare.

The type of study that the authors ran is useful if your goal is to determine whether you should use outputs from a model or deal with MTurk. But results from study designs like the one in the paper rarely generalize beyond the exact type of HIT you're studying and the exact workers you finally identify. And even then you need constant vigilance.

I genuinely have no idea why academics use MTurk for these types of small experiments. For a study of this size, getting human participants that fit some criteria to show up at a physical lab space or login to a zoom call is easier and more robust than getting a sufficiently non-noisy sample from MTurk. The first derivative on your dataset size has to be like an order of magnitude higher than the overall size of the task they're doing for the time investment of hardening an MTurk HIT to even begin make sense.

link

warkdarrior 946 days ago

This is just coming up with excuses for the MTurk workers. "they were barely paying attention", "they were button mashing", "they weren't a single human", etc.

It turns out that GPT-4 does not have those problems. The comparison in the paper is not really fair, since it does not compare average humans vs GPT-4, it compares "humans that did well at our task" vs GPT-4.

link

nrfulton 946 days ago

> This is just coming up with excuses for the MTurk workers

No. The authors are not trying to study MTurk market dynamics. They are trying to compare humans and LLMs.

Both questions are interesting and useful. This study is only asking about the second question. That's okay. Isolating specific questions and studying them without a bunch of confounds is one of the basic principles of experiment design. The experiment isn't intended to answer every question all at once. It's intended to answer one very specific question accurately.

LLMs can both be worse at Mensa tasks and also better than humans at a variety of reasoning tasks that have economic value. Or, LLMs can be worse at those reasoning tasks but still reasonably good enough and therefore better on a cost-adjusted basis. There's no contradiction there, and I don't think the authors have this confusion.

> The comparison in the paper is not really fair

The study is not trying to fairly compare these two methods of getting work done in general. It's trying to study whether LLMs have "abstraction abilities at humanlike levels", using Mensa puzzles as a proxy.

You can take issues with the goal of the study (like I do). But given that goal, the authors' protocols are completely reasonable as a minimal quality control.

Or, to put this another way: why would NOT filtering out clickbots and humans speedrunning surveys for $0.25/piece result in a more insightful study given the author's stated research question?

> It turns out that GPT-4 does not have those problems.

I think the authors would agree but also point out that these problems aren't the ones they are studying in this particular paper. They would probably suggest that this is interesting future work for themselves, or for labor economists, and that their results in this paper could be incorporated into that larger study (which would hopefully generalize beyond MTurk in particular, since MTUrk inter alia are such uniquely chaotic subsets of the labor market).

For me, the problems with the study are:

1. The question isn't particularly interesting because no one cares about Mensa tests. These problem sets make an implicit assumption that psychometric tools which have some amount of predictive power for humans will have similar predictive power for LLMs. I think that's a naive assumption, and that even if correlations exist the underlying causes are so divergent that the results are difficult to operationalize. So I'm not really sure what to do with studies like this until I find an ethical business model that allows me to make money by automating Mensa style test-taking en masse. Which I kind of hope will ever exist, to be honest.

2. MTurk is a hit mess (typo, but sic). If you want to do this type of study just recruit human participants in the old fashioned ways.

But given the goal of the authors, I don't think applying MTurk filters is "unfair". In fact, if anything, they're probably not doing enough.

link