Hacker News new | ask | show | jobs
by rowyourboat 932 days ago
Isn't the main problem that academics are measured by the number of publications they publish, and reproductions of existing studies aren't published by the main journals, thus there is little incentive to try and reproduce findings? I never thought this was a problem of ability.
5 comments

You're also leaving out the biggest issue. Journals generally don't want to produce negative results. If you spend researching [shocking possibility] and it turns out that [shocking possibility] isn't true, you're not getting published. It motivates everything from HARKing [1] to outright data manipulation. By contrast if negative results were seen as valuable, then none of this is an issue.

On the other hand, it really is the case that there's just not much of any value in learning that [shocking possibility] is, as everybody would naturally expect, indeed not the case. And filling up limited journal space with such discoveries would seem to be counter-productive, at best. And when you have limited space/funding for researchers, one guy who keeps proving everything everybody knows to be false, to be false, is always going to be perceived as less valuable than one making [shocking discovery] [... which ends up being proven false years later].

[1] - https://en.wikipedia.org/wiki/HARKing

> If you spend researching [shocking possibility] and it turns out that [shocking possibility] isn't true, you're not getting published

But this simply isn't true in physics where negative results are very common. This is at least an existence proof that this can work, people just have to get their heads straight on what research means.

By "journal space" you of course mean journal prestige that isn't unlimited. The point of science journals is gatekeeping.
The biggest problem is honestly obtained incorrect results. If you run 1000 experiments across 1000 labs. Few will statistically not notice a mistake and get a wrong result. That wrong result is then published as it is surprising.
I think there are some strong arguments against this. The first is numerical. Fields like social psychology are seeing replication rates as low as the twenties. And not just from low hanging fruit from but from journals like "The Journal of Personality and Social Psychology", which has one of the highest impact factors across all psychology journals, and a 23% replication success rate! [1] This [2] is a Google search for site:nytimes.com "Journal of Personality and Social Psychology". It's interesting seeing how many [shocking discovery]s, many which end up being shared on this site, come from this particular journal.

Furthermore, I think you can often see poorly done science in the papers themselves. They will use suggestive wording in surveys, unreliable sources for sampling such as Amazon Mechanical Turk, and maybe one of the biggest tells is measuring a large number of unnecessary variables. That does very little to further your experiment, but absolutely ensures you can p-hack your way to a statistically significant result. Another is ignoring such patently obvious viable confounding issues, that one can't reasonably appeal to Hanlon's razor.

[1] - https://en.wikipedia.org/wiki/Replication_crisis#In_psycholo...

[2] - https://www.google.com/search?q=site%3Anytimes.com%20%22Jour...

It can also be hard to judge whether replication failed because the result is bogus or replication failed because the replication team is themselves incompetent.
Did they follow the exact same steps claimed to result in something? Did it result in that thing?

The repetition team being incompetent sounds like a cop out. The researcher did a bad job and it’s on them to explain better etc in that case. No excuses, if it can’t be reproduced it isn’t taken seriously no exceptions

You have two groups of people. Either is equally likely to be incompetent.
How do you know which? Both will point fingers at the other.
So just make sure someone unaffiliated has to be able to reproduce whatever research has been conducted. Tough luck if it doesn’t make it, that’s why you do your best to ensure you’ve verified it’s a real result.
Unfortunately this is only part of the problem. Even studies on ML that use public datasets, which are the kinds of studies that when code is shared should be very easy to reproduce, are often surprisingly hard to repeat. Sometimes only parts of the code are published, the code has a lot of bugs (who knows why? Added intentionally?), the code is very badly documented, or the exact libraries are not specified properly.

And this is in a field where everything is based on code, where in principle reproducibility is easy. Go into materials science or chemistry and try to synthesize something following a published paper and you get all sorts of problems. Different equipment, different temperature, not all steps documented, ... Reproducing experimental findings can take you months.

It still largely comes down to incentives from what I've seen. A lot of times all anyone (from the researcher to the reviewer) cares about is the paper. Journals don't check that code actually works, and a lot of researchers don't spend time on preparing their code. They feel there's no need, since they now got a new article on their CV. It's true that they may not have the skills and experience to produce good code they can share (depending on the area), but often 1) there's no time to prep code since they've got 3 other projects going on and a crazy work pace 2) the code is seen as something incidental and secondary - what matters to them is the figures and results 3) some groups want to milk a topic for a few papers so they're guarding their code and data. Luckily at least plenty of journals demand access to data or even making it public.
In fact, there's even more incentives for researchers to make reproducing their work as hard as possible. For example, what if someone tried to reproduce it and found contradictory results? In both cases (reproducer made mistake, original made mistake) it's additional hassle that the original authors can basically only suffer and never gain.
This is just you confirming that tons of research is essentially fraudulent. If it can be contradicted it absolutely should be, that is how fields progress and weed out bad ideas.
Page limits certainly don't help!
Another issue is that making things reproducible costs you time and that is exactly what most researchers do not have. For example, many ML papers have code that is just a barely working Jupyter notebook. To make it reproducible you would have to create a reproducible environment, package the data, and prepare scripts that would rerun all the experiments you have done. That can take several weeks, but it will not increase the chance of acceptance for your paper at all.
More precisely, making things reproducible after the fact costs you significant time - there are tools for reproducible setups that take maybe an hour (at most) to setup upfront, after which it takes very little effort to do your work within that framework and keep things reproducible (for eg. Julia has DrWatson, DataDeps, etc., I'd be surprised if Python doesn't have equivalents).

The problem is knowing upfront which of your work would need to be reproducible, or having the discipline to do all your hacking starting from such reproducible setups.

But Julia and Python tools aren't enough. The whole environment has to be reproducible. So many python libraries themselves take shortcuts which work on the current Ubuntu or current state of the web, but will fail to build later by the time someone tries to reproduce the result. Shipping a container just hides the implicit dependencies and assumptions. People need to be packaging for Guix en masse for reproducibility to be feasible. Until then, "reproducibility" is just another lie people are telling themselves and others to try and get ahead in their rat races.
So you say "julia and python tools aren't enough" but then proceed to only talk about Python and say a bunch of stuff that is completely inapplicable the Julia.

Do you know much about how reproducibility is approached in Julia? Maybe hold off on calling it a lie if you're not experienced in what you're talking about.

I have asked about Julia's reproducibility story on the Guix mailing list in the past, and at the time Simon Tournier didn't think it was promising. I seem to recall Julia itself didnt have a reproducible build. All I know now is that github issue is still not closed.

https://github.com/JuliaLang/julia/issues/34753

"reproducible build" in this sense has nothing to do with scientific reproducibility. That issue is about hash-verifiability for the sake of security, and how some autogenerated random paths included in the binary affects that.

Scientific reproducibility requires only that versioned binaries be functionally equivalent if they have the same version, which is quite independent of this and certainly exists in Julia.

Would love a link to the Guix mailing list discussion, if you can dig it up.

I agree with your first sentence, but saying people are fooling themselves and being overoptimistic (by telling themselves lies) is very different from "calling it a lie" (i.e. intentionally deceiving others). That seems like an unnecessarily negative interpretation of what they said. Even if you disagree with it, that does not deserve such a harsh response.
Maybe the cause is funding sources that fund researchers publishing too often, and not funding other researchers to double check their work
No. Several weeks is the time it takes to learn and master Docker.

About two hours is the cumulative time one must cater to the Dockerfile for a 3 weeks project.

But it requires institution insisting on reproducibility, and fostering best practices to make it even easier for the researchers to be compliant.

I get it that reproducibility can be quite hard for biology. But ML cannot be taken as an example of a hard problem.

I agree that docker is great. But docker solves only one of the problems mentioned above (env) and even that solution does not work for some teams that run their experiments in GPU clusters where docker is not supported.
Perhaps the core issue is that academia excels at being a textbook case of goodhart's law. If/when reproducibility became a target then the academic system would/will likely make an equally bad mess as it has with its current targets.
If you fail to reproduce some important research then I think that would absolutely get published. (see the recent superconductor drama)

So if you feel some impactful work is suspicious .. I think disproving it would absolutely be incentivised

If you show its actually correct.. Well then usually it's not that hard to push the envelope a bit further and say something new. That happens all the time

Yes, but in the vast majority of cases, it's hard to tell just by reading a paper if there's been dishonesty somewhere in the pipeline.

Also, the LK-99 example is an exception, not the norm–the chances of receiving significant attention for a replication study are near zero in almost all other cases.

I just don't think it's really relevant. If the research is impactful, then it'll be replicated (at least in part) when the next person tries to build on the results. If it doesn't replicate then they'll probably end up discovering something new/different - and that'll lead to it's own paper.

Even in the ideal world, you effectively almost never end up with a replication paper. Either it replicates and you add on your own novel research. Or it doesn't replicate and you discover something new

You can in theory end up with a super dull null result that disproves someone else's claim. But even then, when you set out on the project you're aim is to add something new on top of what's been already done. This happens all the time

It seems to me, instead of funding a new college or traditional research institute, some benefactor ought to fund a "research reproduction institute", dedicated to identifying and reproducing suspicious publications.
Sort of. Yes and no. There has to be a metric to assess researcher's performance. Otherwise we won't know what research is worthwhile. When the rules of the game are known, players will find their way to cheat, or at least bend the rules to their advantage.

So, for example, suppose negative results become as valuable: well, they are easier to produce. They are also less valuable as stepping stones for further research. Given that, you'd still need to have a metric that compares publishing positive results to negative results. Even if you declare them to be equally important, the shared understanding will be that they aren't. And one would be more important than the other. And here were are back to square one.

There are some minor things that can be done in the near future. For example, results produced with code must come with the code that produced these results. A lot of research bodies resist this because they want to commercialize their code, or their code may inadvertently contain organization's secrets and therefore needs more auditing... but, in the end of the day, it needs to be made clear that this is a necessary and unavoidable price to pay.

Data sharing is even more problematic. Beside confidentiality concerns, data is always a bargaining chip in the game of getting collaborators (and grants). Should it be made public, it loses its value to those who collected it. Right now, the trend is: if you managed to collect a worthwhile dataset, then you'll cover yourself foot to head with NDAs, contracts of all kinds etc, and will sit on it, exploiting it for a series of research. And if anyone wants to do research on the same subject, you will only invite them if they bring grants or equipment etc.

But you cannot really verify results w/o having the data available. Even if you have the code.

---

It's really sad to see how research is doing wrt' programming in part because of the above, but I don't think the programs outlined in OP will have a noticeable effect. They don't paint a convincing picture in terms of incentives, i.e. they don't answer the question why would researches want to do any of that RepRes and OS training. Even in computationally-heavy research today you often find that all the computation work is outsourced by the researchers and they themselves have no clue what their code is doing.

Above were all sorts of arguments for why the current (or yours) approaches are ineffective. But I don't claim to know what needs to be done.