Hacker News new | ask | show | jobs
by bumby 523 days ago
>What people don't realize is that reproducing from scratch the algorithm is also very very efficient.

This is where we differ. Especially if the author shares neither the data or the code, because you can never truly be sure it's a software bug or a data anomaly or a bad method or outright fraud. So you can end up burning tremendous amounts of time investigating all those avenues. That statement (as well as others about how trivial replication is) makes me think you don't actually try to replicate anything yourself.

>there is a contradiction in saying "people will study the code intensively" and "people will go faster because they don't have to write the code".

I never said "people will go faster" because they don't have to write the code. Maybe you're confusing me with another poster. You were the one who said sharing code is worthless because people can "click on the button and you get the same result". My point, and maybe this is where we differ, is that for the ultimate goal is not to create the exact same results. The goal I'm after is to apply the methodology to something else useful. That's why we share the work. When it doesn't seem to work, I want to go back to the original work to figure out why. The way you talk about the publication process tells me you don't do very much of this. Maybe that's because of your work at CERN is limited in that regard, but when I read interesting research I want to apply it to different data that are relevant to the problems I'm trying to solve. This is the norm outside of those who aren't studying the replication crisis directly.

>I say "bad paper that turns out to have errors (involuntary or not) are anecdotal"

My answer was not conflating peer-review and code sharing and replication (although I do think they are related). My answer was to give you researchers who work in this area because their work shows it is far from anecdotal. My guess is you didn't bother to look it up because you've already made up your mind and can't be bothered.

>I ask you to give example where the replication crisis was avoided by sharing the data, you talk about bad papers that turns out to have errors

Because it's a bad question. A study that is replicated using the same data is "avoiding the replication crisis". Did you really want me to list studies that have been replicated? Go on Kaggle or Figshare or Genbank if you want example of datasets that have been used (and replicated), like CORD-19 or NIH-dbGaP or World Values Survey or any host of other datasets. You can find plenty of published studies that use that data and try to replicate them yourself.

>how on hell CERN is not bursting with fire

The referenced authors talk about how physics is generally the most replicable. This is largely because they have the most controlled experimental setups. Other domains that do much worse in terms of replicability are hampered by messier systems, ethical considerations, etc. that limit the scientific process. In the larger scheme of things, physics is more of an anomaly and not a good basis to extrapolate to the state of affairs for science as a whole. I tend to think you being in a bubble there has caused you to over-extrapolate and have too strong of a conclusion. (You should also review the HN guidelines that urge commenters to avoid using caps for emphasis)

>"sharing the code...but it's not "the good practice""

I'm not sure if you think sharing a single unsourced quip is convincing but, your anecdotal discussion aside, lots of people disagree with you and your chemist friend. Enough so that it's become a more and more common practice (and even requirement in some journals) to share data and code. Maybe that's changed since your time at uni, and probably for the better.

1 comments

Rolling eyes.

> Especially if the author shares neither the data or the code

What are you talking about. In this example, why do you invent they are not sharing the data? That's the whole point.

> A study that is replicated using the same data is "avoiding the replication crisis"

BULLSHIT. You can build confidence by redoing the experience with the same data, but it is just ONE PART and it is NOT ENOUGH. If there is a statistical fluctuation in the data, both studies will conclude something false.

I have of course reproduced a lot of algorithm myself, without having the code. It's not complicated, the paper explains what you need to do (and please, if your problem is that the paper does not explain, then the problem is not about sharing the code, it's about paper badly explaining).

And again, my argument is "nobody share data" (did you know that some study also shares code? Did you know that I have occasionally shared code? Because, as I've said before, it can be useful), but that "some don't share data and yet are still doing very good, both on performance, on fraud detection or on replication".

For the rest, you are just saying "my anecdotal observations are better than yours".

But meanwhile, even Terence Tao does not say what you pretend he says, so I'm sure you believe people agree with you, but it does not mean they do.

>Rolling eyes.

Please review and adhere to the HN guidelines before replying again.

>why do you invent they are not sharing the data?

Because you advocated that very point. You: "some data is better not to share too" The point in sharing is that I want to interrogate your data/code to see if it's biased or misrepresented or prone to error if it doesn't seem to work for the specialized problem I am trying to apply it to. When you don't share it and your problem doesn't replicate, I'm left wondering "Is it because they have something unique in their dataset that doesn't generalize to my problem?"

>BULLSHIT.

Please review and adhere to the HN guidelines before replying again.

>It's not complicated

You can make this general claim about all papers based on your individual experience? I've already explained why your personal experience is probably not generalizable across all domains.

>you are just saying "my anecdotal observations are better than yours".

No, I'm saying the systematically studied, published, and replicated studies trump your anecdotal claims. I've given you some example authors, if you have an issue with their methods, delineate the problems explicitly rather than sharing weak anecdotes.

> Because you advocated that very point. You: "some data is better not to share too"

SOME data. SOME. You've concluded, incorrectly, that I was pretending that sharing data is not useful all the time, which is not at all what I've said.

> You can make this general claim about all papers based on your individual experience?

What? Do you even understand basic logic? I'm saying that I've observed SOME paper where sharing the code did not help. I'm not saying sharing the code never help (I've said that already). I'm just saying that people usually don't understand the real cause of the problem, and invent that sharing the code will help, while in fact doing other things (for example being more precise in the explanation) will solve the problem without having to pay for the unblinding that sharing the code generate.

Sure, one reason I say that is because of my experience, even if my observations are not at all limited to one field as I've exchanged on the subject with many scientists. But another reason is that when I discuss the subject, the people who overestimate the gain of sharing the code really have difficulties to understand the disadvantages in sharing the code.

Yourself, you seems to not understand what we need for a good replication. Replication is supposed to independently demonstrate, so we build up the confidence in the conclusions. Rerunning with the same data or the same code is not enough, because it does not prove that the conclusions will remain valid if we try with other data or other implementation. When you understand that, only then you understand that sharing the code has a price to pay.

By the way, it will also explain why CERN is doing something that, according to you, has absolutely no reason to exist except for cheating. Of course, if it was the case, intellectually honest scientists would all ask CERN to cancel these policies. They don't, because there are real reasons why scientists may prefer in some case to forbid sharing code and data (not just "I don't do it myself because I'm lazy", but "I don't do it because it's a specific rule, they explicitly say it's a bad thing to do it").

And, sure, maybe it is not everywhere. But it does not matter. It's a counter-example that demonstrates that your hypothesis does not work. If your hypothesis was true, what CERN does would not be possible, it would be obviously a bad move and would be attacked.

> I've given you some example authors, if you have an issue with their methods, delineate the problems explicitly rather than sharing weak anecdotes.

These studies do not conclude that sharing the code is a good solution. None of these studies are in contradiction with what I say.

Of course, from someone who think that saying "some data is better not to share too" and conclude that it means "data is better to never be shared", or that did not understood the point of Tao, I'm sure you are convinced they say that. They just don't.