|
|
|
|
|
by cosmojg
946 days ago
|
|
> In the first batch of participants collected via Amazon Mechanical Turk, each received 11 problems (this batch also only had two “minimal Problems,” as opposed to three such problems for everyone else). However, preliminary data examination showed that some participants did not fully follow the study instructions and had to be excluded (see Section 5.2). If they stuck to the average Mechanical Turk worker instead of filtering for "Master Workers," the parent's conclusions likely would've aligned with those of the study. Unfortunately, it seems the authors threw out the only data that didn't support their hypothesis as GPT-4 did, in fact, outperform the median Mechanical Turk worker, particularly in terms of instruction following. |
|
MTurk, to first approximate, is a marketplace that pays people pennies to fill out web forms. The obvious thing happens. The median Mechanical Turk worker probably either isn't a human, isn't just a (single) human, and/or is a (single) human but is barely paying attention + possibly using macros. Or even just button mashing.
That was true even before GPT-2. Tricks like attention checks and task-specific subtle captcha checks have been around for almost as long as the platform itself. Vaguely psychometric tasks such as ARC are particularly difficult -- designing hardened MTurk protocols in that regime is a fucking nightmare.
The type of study that the authors ran is useful if your goal is to determine whether you should use outputs from a model or deal with MTurk. But results from study designs like the one in the paper rarely generalize beyond the exact type of HIT you're studying and the exact workers you finally identify. And even then you need constant vigilance.
I genuinely have no idea why academics use MTurk for these types of small experiments. For a study of this size, getting human participants that fit some criteria to show up at a physical lab space or login to a zoom call is easier and more robust than getting a sufficiently non-noisy sample from MTurk. The first derivative on your dataset size has to be like an order of magnitude higher than the overall size of the task they're doing for the time investment of hardening an MTurk HIT to even begin make sense.