|
|
|
|
|
by nrfulton
941 days ago
|
|
> Unfortunately, it seems the authors threw out the only data that didn't support their hypothesis as GPT-4 did, in fact, outperform the median Mechanical Turk worker, particularly in terms of instruction following. MTurk, to first approximate, is a marketplace that pays people pennies to fill out web forms. The obvious thing happens. The median Mechanical Turk worker probably either isn't a human, isn't just a (single) human, and/or is a (single) human but is barely paying attention + possibly using macros. Or even just button mashing. That was true even before GPT-2. Tricks like attention checks and task-specific subtle captcha checks have been around for almost as long as the platform itself. Vaguely psychometric tasks such as ARC are particularly difficult -- designing hardened MTurk protocols in that regime is a fucking nightmare. The type of study that the authors ran is useful if your goal is to determine whether you should use outputs from a model or deal with MTurk. But results from study designs like the one in the paper rarely generalize beyond the exact type of HIT you're studying and the exact workers you finally identify. And even then you need constant vigilance. I genuinely have no idea why academics use MTurk for these types of small experiments. For a study of this size, getting human participants that fit some criteria to show up at a physical lab space or login to a zoom call is easier and more robust than getting a sufficiently non-noisy sample from MTurk. The first derivative on your dataset size has to be like an order of magnitude higher than the overall size of the task they're doing for the time investment of hardening an MTurk HIT to even begin make sense. |
|
It turns out that GPT-4 does not have those problems. The comparison in the paper is not really fair, since it does not compare average humans vs GPT-4, it compares "humans that did well at our task" vs GPT-4.