| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by equinox12 681 days ago

I think this analysis is misguided.

Even considering a historic bias for counter-intuitive results in social science, this has no bearing on the results of the paper being discussed. Most of the survey experiments that the researchers used in their analyses came from TESS, an NSF-funded program that collects well-powered nationally representative samples for researchers. A key thing to note here is that not every study from TESS gets published. Of course, some do, but the researchers find that GPT4 can predict the results of both published and unpublished studies at a similar rate of accuracy (r = 0.85 for published studies and r = 0.90 for unpublished studies). Also, given that the majority of these studies 1) were pre-registered (even pre-registering sample size), 2) had their data collected through TESS (an independent survey vendor), and 3) well-powered + nationally-representative, makes it extremely unlikely for them to have been p-hacked. Therefore, regardless of what the researchers hypothesized, TESS still collected the data and the data is of the highest quality within social science.

Moreover, the researchers don't just look at psychology or sociology studies, there are studies from other fields like political science and social policy, for example, so your critiques about psychology don't apply to all the survey experiments.

Lastly, the study also includes a number of large-scale behavioral field experiments and finds that GPT4 can accurately predict the results of these field experiments, even when the dependent variable is a behavioral metric and not just a text-based response (e.g., figuring out which text messages encourage greater gym attendance). It's hard for me to see how your critique works in light of this fact also.

1 comments

authorfly 681 days ago

Yes, I am sure you should have said the same about the research before 2011 with the replication crisis, when it was always claimed that scientists like Bell (premonition) and Baumeister (Ego-depletion) could not possibly be faking their findings - they contributed so much, their models have "theoretical validity", they had hundreds of studies and other researchers building on their work! They had big samples. Regardless of TESS/NSF, the studies it focuses are have been funded (as you mention) and they were simply not chosen randomly. People had to apply to grants. They had to bring in early, previous or prototype results to convince people of funding.

The specificness to psychology applies to most fields in the soft sciences with their typical research techniques.

The main point is that prior research shows absolutely no difference between field experts and random people in predicting the results of studies, per-registered, replications, and others.

GPT-4 achieving the same approximate success rate as any person has nothing whatsoever to do with it simulating people. I suspect an 8 year old could reliably predict psychology replications after 10 years with about the same accuracy. It's also key that in prior studies, like the one I linked, this same lack of difference occurred even when the people involved were provided additional recent resources from the field, although with higher prediction accuracy.

The meat of the issue is simple - show me a true positive study, make the predictions on whether it will replicate, and let's see in 10 years when replication efforts have been taken out, whether GPT-4 is any higher than a random 10 year old who no information on the study. The implied claim here is that since GPT-4 can supposedley simulate sociology experiments and so more accurately judge the results, we can iterate it and eventually conduct science that way or speed up the scientific process. I am telling you that the simulation aspect has nothing to do with the success of the algorithm, which is not really outpeforming humans because to put it simply, humans are bad at using any subject-specific or case knowledge to predict the replication/success of a specific study(there is no difference between lay people and experts) and the entire set of published work is naturally biased anyhow. In other words, this style may elicit higher test score results, by altering the prompt.

The description of the role of GPT-4 here as simulating is a human theoretical construction. We know that people with a knowledge advantage are not able to apply this to predicting output results any more accurately than lay people. That is because they are trying to predict a biased dataset. The field of sociology as a whole, as are most studies that involve humans (because they are vastly underfunded for large samples) struggles to replicate or conduct scientific in a reliable, repeatable way, and until we resolve that, the GPT-4 claims of simulating people, are spurious and unrelated at best, misleading at worst.

link

equinox12 681 days ago

I'm not sure how to respond to your point about Bem and Baumeister's work since those cases are the most obvious culprits for being vulnerable to scientific weakness/malpractice (in particular, because they came before the time of open access science, pre-registration, and sample sizes calculated from power analyses).

I also don't get your point about TESS. It seems obvious that there are many benefits for choosing the repository of TESS studies from the authors' perspective. Namely, it conveniently allows for a consistent analytic approach since many important things are held constant between studies such as 1) the studies have the exact same sample demographics (which prevents accidental heterogeneity in results due to differences in participant demographics) and 2) the way in which demographic variables are measured is standardized so that the only difference between survey datasets is the specific experiment at hand (this is crucial because the way in which demographic variables are measured varies can affect the interpretation of results). This is apart from the more obvious benefits that the TESS studies cover a wide range of social science fields (like political science, sociology, psychology, communication, etc., allowing for the testing of robustness in GPT predictions across multiple fields) and all of the studies are well-powered nationally representative probability samples.

Re: your point about experts being equal to random people in predicting results of studies, that's simply not true. The current evidence on this shows that, most of the time, experts are better than laypeople when it comes to predicting the results of experiments. For example, this thorough study (https://www.nber.org/system/files/working_papers/w22566/w225...) finds that the average of expert predictions outperforms the average of laypeople predictions. One thing I will concede here though is that, despite social scientists being superior at predicting the results of lab-based experiments, there seems to be growing evidence that social scientists are not particularly better than laypeople at predicting domain-relevant societal change in the real world (e.g., clinical psychologists predicting trends in loneliness) [https://www.cell.com/trends/cognitive-sciences/abstract/S136... ; full-text pdf here: https://www.researchgate.net/publication/374753713_When_expe...]. Nonetheless, your point about there being no difference in the predictive capabilities of experts vs. laypeople (which you raise multiple times) is just not supported by any evidence since, especially in the case of the GPT study we're discussing, most of the analyses focus on predicting survey experiments that are run by social science labs.

Also, based on what the paper is suggesting, the authors don't seem to be suggesting that these are "replications" of the original work. Rather, GPT4 is able to simulate the results of these experiments like true participants. To fully replicate the work, you'd need to do a lot more (in particular, you'd want to do 'conceptual replications' wherein you the underlying causal model is validated but now with different stimuli/questions).

Finally, to address the previous discussion about the authors finding that GPT4 seems to be comparable to human forecasters in predicting the results of social science experiments, let's dig deeper into this. In the paper, but specifically in the supplemental material, the authors note that they "designed the forecasting study with the goal of giving forecasters the best possible chance to make accurate predictions." The way they do this is by showing laypeople the various conditions of the experiment and have the participants predict where the average response for a given dependent variable would be within each of those conditions. This is very different from how GPT4 predicts the results of experiments in the study. Specifically, they prompt GPT to be a respondent and do this iteratively (feeding it different demographic info each time). The result of this is essentially the same raw data that you would get from actually running the experiment. In light of this, it's clear that this is a very conservative way of testing how much better GPT is than humans at predicting results and they still find comparable performance. All that said, what's so nice about GPT being able to predict social science results just as well as (or perhaps better than) humans? Well, it's much cheaper (and efficient) to run thousands of GPT queries than is to recruit thousands of human participants!

link

authorfly 680 days ago

Fair enough, you might have indeed rejected those authors - however, vast swathes, for Baumeister the majority, did not at the time. It's almost certainly true now for existing authors we are yet to identify, or maybe never will.

I admit the point on TESS, I didn't research that enough. I'll look into that at a later point as I have an interest in learning more.

To address your studies regarding expert / study forecasting - thank you for sharing some papers. I had time and knew some papers in the area so I have formulated a response because, as you allude to later regarding cultural predictions, there is debate in the question of the usefulness of expert vs non-expert forecasts (and e.g. there is a wide base of research on recession/war predictions showing the error rate is essentially random at a certain number of years out). I have not fully comprehended the first paper but I understand the gift of it.

Economics bridges sociology and the harder science of mathematics, and I do think it makes sense for it to be more predictable than psychology studies by experts(and note the studies being predicted were not survey-response like most are in psychology), but even this one paper does not particularly support your point. Critically, one conclusion in the paper you cite is that "Forecasters with higher vertical, horizontal, or contextual expertise do not make more accurate forecasts.", "If forecasts are used just to rank treatments, non-experts, including even an easy-to-recruit online sample, do just as well as experts", and "Fourth, experts as a group do better than non-experts, but not if accuracy is defined as rank ordering treatments.". "The experts are indistinguishable with respect to absolute forecast error, as Column 7 of Table 4 also shows... Thus, various measures of expertise do not increase accuracy". Critically at a glance, of the selected statements, almost 40% are outperformed by non-experts anyhow in Table 2 (the last column). I also question the use of Mturk Workers as lay people(because of historic influences of language and culture on IQ tests, the lay person group would be better being at least geographically or WEIRD-ly similar to the expert groups), but that's a minimal point.

Another point that further domain information, simulation or other tactics does not impact the root issue of the biased dataset of published papers - "Sixth, using these measures we identify `superforecasters' among the non-experts who outperform the experts out of sample.". Might we be in danger, with some claim 8 years later with LLMs, of the very "As academics we know so little about the accuracy of expert forecasts that we appear to hold incorrect beliefs about expertise and are not well calibrated in our accuracy. " that the paper warns against?

I know what you are getting at that these are not replications, that it feels elementally exciting that GPT-4 could simulate a study taking place - rather than a replication as such - and determine the result more accurately than a human forecast. But what I am saying is, historically, we have needed replication data to assess if human forecasts (expert and non-expert) are correct long term anyhow, and we need those to be for future or current replications to avoid the training data including the results, to draw any conclusion about the method of GPT-4 in getting this accuracy in forecasting results with any method, simulation or direct answer. The idea that it is cheaper to run GPT queries than recruit human participants makes me wonder if you are actively trolling though - you can't be serious? Fields in which awful statistics and research goes on all the time, awaiting an evolution to a better basic method, and a result that is accurate 3% higher than a group of experts, when we don't even know whether those studies will replicate in the long run (and yes, even innocently pre-registered research tends to proliferate more false positives because the proportion of pre-registered studies published is not close to 100% and thus the results of false-positive publishing still occur https://www.youtube.com/watch?v=42QuXLucH3Q

The problem is until we have fundamentals more stable, small increments and large claims on behaviour are repeating the mistake of anthropomorphizing biological and computational systems before we understand them to the level we need to, to make those claims. I am saying the future is bright in this regard- we will likely understand these systems better and one day be able to make these claims, or counter-claims. And that is exciting.

Now this is a seperate topic/argument, but here is why I really care about these non-substantial, but newsworthy claims: Lets not jump the gun for credence. I read a PhD AI paper in 2011. It was the very furthest from making bold claims - people were so low-mooded about AI. That is because AI was pretty much at its lowest in 2011, especially with cuts after the recession. It was a cold part of the "AI winter". Now that AI is raring at full speed, people overclaim. This will cause a new, 3rd AI winter. Trust me, it will, so many members of faculty I know started feeling this way even back in 2020. It's harmful not only to the field but our understanding really, to do this.

link