|
I think this analysis is misguided. Even considering a historic bias for counter-intuitive results in social science, this has no bearing on the results of the paper being discussed. Most of the survey experiments that the researchers used in their analyses came from TESS, an NSF-funded program that collects well-powered nationally representative samples for researchers. A key thing to note here is that not every study from TESS gets published. Of course, some do, but the researchers find that GPT4 can predict the results of both published and unpublished studies at a similar rate of accuracy (r = 0.85 for published studies and r = 0.90 for unpublished studies). Also, given that the majority of these studies 1) were pre-registered (even pre-registering sample size), 2) had their data collected through TESS (an independent survey vendor), and 3) well-powered + nationally-representative, makes it extremely unlikely for them to have been p-hacked. Therefore, regardless of what the researchers hypothesized, TESS still collected the data and the data is of the highest quality within social science. Moreover, the researchers don't just look at psychology or sociology studies, there are studies from other fields like political science and social policy, for example, so your critiques about psychology don't apply to all the survey experiments. Lastly, the study also includes a number of large-scale behavioral field experiments and finds that GPT4 can accurately predict the results of these field experiments, even when the dependent variable is a behavioral metric and not just a text-based response (e.g., figuring out which text messages encourage greater gym attendance). It's hard for me to see how your critique works in light of this fact also. |
The specificness to psychology applies to most fields in the soft sciences with their typical research techniques.
The main point is that prior research shows absolutely no difference between field experts and random people in predicting the results of studies, per-registered, replications, and others.
GPT-4 achieving the same approximate success rate as any person has nothing whatsoever to do with it simulating people. I suspect an 8 year old could reliably predict psychology replications after 10 years with about the same accuracy. It's also key that in prior studies, like the one I linked, this same lack of difference occurred even when the people involved were provided additional recent resources from the field, although with higher prediction accuracy.
The meat of the issue is simple - show me a true positive study, make the predictions on whether it will replicate, and let's see in 10 years when replication efforts have been taken out, whether GPT-4 is any higher than a random 10 year old who no information on the study. The implied claim here is that since GPT-4 can supposedley simulate sociology experiments and so more accurately judge the results, we can iterate it and eventually conduct science that way or speed up the scientific process. I am telling you that the simulation aspect has nothing to do with the success of the algorithm, which is not really outpeforming humans because to put it simply, humans are bad at using any subject-specific or case knowledge to predict the replication/success of a specific study(there is no difference between lay people and experts) and the entire set of published work is naturally biased anyhow. In other words, this style may elicit higher test score results, by altering the prompt.
The description of the role of GPT-4 here as simulating is a human theoretical construction. We know that people with a knowledge advantage are not able to apply this to predicting output results any more accurately than lay people. That is because they are trying to predict a biased dataset. The field of sociology as a whole, as are most studies that involve humans (because they are vastly underfunded for large samples) struggles to replicate or conduct scientific in a reliable, repeatable way, and until we resolve that, the GPT-4 claims of simulating people, are spurious and unrelated at best, misleading at worst.