Hacker News new | ask | show | jobs
by mNovak 2104 days ago
Does anyone have links to the Replication Prediction Market results mentioned in the article? That sounds super interesting.

As an amusing nudge, I bet you could do some ML to predict replicability of a paper (per author's suggestion that it's laughably easy to predict) and release that as a tool for authors to do some introspection on their experimental design (assuming they're not maliciously publishing junk).

2 comments

Here’s the paper, published only recently: https://royalsocietypublishing.org/doi/10.1098/rsos.200566

> I bet you could do some ML to predict replicability of a paper (per author's suggestion that it's laughably easy to predict)

I am betting any such ML system could be gamed and addressing the issue would ultimately still need humans in the loop. For example, what if I am selective with my data, beyond the visibility of ML evaluating the final published paper? I don’t think this is “laughably easy” to predict. It may be easy to spot telltale signs today that predict replicability, but as soon as those markers are understood, I imagine authors will simply squeeze papers through the cracks in a different way.

Another issue is this bit from the author on Twitter:

> Just because it replicates doesn't mean it's good. A replication of a badly designed study is still badly designed. There are tons of papers doing correlational analyses yet drawing causal conclusions, and many of them will successfully replicate. Doesn't mean they're justified.

IIRC from prior discussions of this, a lot of the accuracy of the markets comes from people just applying common sense - like, if a really surprising claim that people should really have noticed before now comes with a huge effect size, it's probably false. ML can't judge that because it doesn't have the ability to do basic sanity checks on claims like that. It takes a sceptical human with life experience to do that.
Huh? That sounds exactly the thing that a ML system would learn quickly from the data. You probably don't even need shiny deep learning (though it helps).

Just like with the Netflix Prize stuff, where the conclusion was very similar, ie. just dump in as much data as you can, crank up the ML machinery, and it'll discover the features (better than you can engineer them) and learn what to use for recommendation ranking. And that's basically what we see with GPT-3 too. If you have some useful labels in the data it'll learn them even without supervision, because it has so many parameters, it basically sticks.

Get some papers run it through a supervised training phase where you give it a set with every paper scored based on how retracted/bad/unreplicating it is and you'll get a great predictor. Then run it to find papers that stick out, and then have a human look at them, and try to replicate some of them to fine-tune the predictor. Plus continue to feed it with new replication results.

That said, using an ML system as the gatekeeper as OP suggested is a bad idea, as it'll quickly result in the loss of proxy variables' predictive power.

Though ultimately a GPT-like system has the capacity to encode "common sense".

Even GPT-3 doesn't encode common sense, which is why it can't do a lot of basic physical reasoning. It's "just" word prediction, albeit very impressive word prediction.

If you look at what GPT produces closely, a lot of it is simply bullshit. It sounds plausible but is wrong. That's exactly the wrong type of AI to detect plausible-but-wrong-bullshit papers, which are the most common type.

Right, I worded that a bit lazily. There's no confidence score output from GPT-3, but if there were and if the user would select to only get high confidence results then it would shut up quickly. And that's what I meant by common sense. Of course it depends on the corpus. It's really-really just text, as you said. (It's possible that it can somehow eventually encode high level things like arithmetic, but so far it seems, even if it does have that model somewhere embedded, it doesn't know how/when to use it.)

The language model (GPT-3) doesn't have to understand physics, it just have to help extract out some semantics of the paper.

There needs to be a classifier on top trained with a labeled set of good and bad papers.

I think there is a confidence score actually! Most blogs about it don't show them but this one went into it:

https://arr.am/2020/07/25/gpt-3-uncertainty-prompts/

It's really cool how the uncertainty prompts alter the confidence associated with the next words.

I guess I'm not disagreeing with you in the abstract that a theoretically strong enough AI could identify bad papers, especially if it had some help for 'real' arithmetic. It at least could flag the most basic issues like plagiarism, cited documents that don't contain the cited fact, etc. Detecting claims that are themselves implausible seems like the hardest task possible, however. That's very close to general AI.

> Detecting claims that are themselves implausible seems like the hardest task possible, however. That's very close to general AI.

Yes, of course. I was simply trying to say that an AI can be quite successful in detecting the usual no-nos, eg. multiple comparisons without correcting for it, p-hacking, or ... who knows what "feature" the classifier would find. Maybe there's simply none, so it'll be really up to subject matter experts to review them. (But it's unlikely, because there are quite successful blogs devoted to simply picking apart shoddy papers simply based on looking at the controls, and other parts of experiment design and the methods sections, and of course the aforementioned stats.)