| HN Mirror

> This is just coming up with excuses for the MTurk workers

No. The authors are not trying to study MTurk market dynamics. They are trying to compare humans and LLMs.

Both questions are interesting and useful. This study is only asking about the second question. That's okay. Isolating specific questions and studying them without a bunch of confounds is one of the basic principles of experiment design. The experiment isn't intended to answer every question all at once. It's intended to answer one very specific question accurately.

LLMs can both be worse at Mensa tasks and also better than humans at a variety of reasoning tasks that have economic value. Or, LLMs can be worse at those reasoning tasks but still reasonably good enough and therefore better on a cost-adjusted basis. There's no contradiction there, and I don't think the authors have this confusion.

> The comparison in the paper is not really fair

The study is not trying to fairly compare these two methods of getting work done in general. It's trying to study whether LLMs have "abstraction abilities at humanlike levels", using Mensa puzzles as a proxy.

You can take issues with the goal of the study (like I do). But given that goal, the authors' protocols are completely reasonable as a minimal quality control.

Or, to put this another way: why would NOT filtering out clickbots and humans speedrunning surveys for $0.25/piece result in a more insightful study given the author's stated research question?

> It turns out that GPT-4 does not have those problems.

I think the authors would agree but also point out that these problems aren't the ones they are studying in this particular paper. They would probably suggest that this is interesting future work for themselves, or for labor economists, and that their results in this paper could be incorporated into that larger study (which would hopefully generalize beyond MTurk in particular, since MTUrk inter alia are such uniquely chaotic subsets of the labor market).

For me, the problems with the study are:

1. The question isn't particularly interesting because no one cares about Mensa tests. These problem sets make an implicit assumption that psychometric tools which have some amount of predictive power for humans will have similar predictive power for LLMs. I think that's a naive assumption, and that even if correlations exist the underlying causes are so divergent that the results are difficult to operationalize. So I'm not really sure what to do with studies like this until I find an ethical business model that allows me to make money by automating Mensa style test-taking en masse. Which I kind of hope will ever exist, to be honest.

2. MTurk is a hit mess (typo, but sic). If you want to do this type of study just recruit human participants in the old fashioned ways.

But given the goal of the authors, I don't think applying MTurk filters is "unfair". In fact, if anything, they're probably not doing enough.