Hacker News new | ask | show | jobs
by mechanical_fish 6036 days ago
A collection of raw data is full of systematic errors, accidental mistakes, misleading black swans, and false trails (some of which get followed for years before they finally turn out to be false). I've seen several talented, well-trained, and highly experienced scientists fool themselves for decades with their own raw data. That's why it is called raw. That's why you have to analyze data, over and over, until you can't stand it anymore, and only publish the last tiny fraction that comes out: Your best work, the stuff that you're confident in and prepared to stand behind. And that's why there's a lot more to science than just reading a lot of numbers off the front panel of your instrument and sticking them up on the web.

If I were a scientist in a controversial field, where every dropped decimal point, statistical anomaly, and speculative sentence (later to be disproved, and to make even its own author blush with the memory) was liable to be mined out of my notebooks and splashed all over the tabloids, I'd sure as hell refuse to release my raw data. Indeed, I might just decide not to release any data at all, but just switch to another field. That's obviously one of the goals of this campaign of intimidation.

4 comments

One of the core parts of modern science is reproducable results - to allow anyone to take data, follow through the methods used, and locate errors (or see if something is an anomaly, in the case of experiments). Without it, science is basically meaningless - one must rely on the word of a group of people for their conclusion, and it is essentially pointless to publish the method (as it's impossible for anyone to recreate the research).
You have to release data and methods that allow other people to recreate the research. (And, obviously, your colleagues are free to object that you haven't published enough, and to ask you for more.)

But that's not the same as releasing everything you ever write down to anyone who asks, which is what the original comment seemed to be suggesting.

The problem with your raw data is that, in the hands of an opponent, especially one who argues in bad faith, the word raw is quickly and easily filed off and it gets described as "your data", despite the fact that you threw it away and didn't publish it, presumably for a reason.

It's easy to make a scientist look ridiculous -- to a nonscientist -- by poking fun at their unpublished data, just as it's easy to make a great novelist look ridiculous by poking fun at their grocery lists, their kindergarten handwriting assignments, or their unpublished first drafts.

If one is unwilling to share their data and methods, then they should not participate in scientific research. The American Physical Society, for one, expects scientists to "Expose their ideas and results to independent testing and replication by others. This requires the open exchange of data, procedures and materials."* In this case, reproduction is not an option—even by the original authors.

http://www.aps.org/policy/statements/99_6.cfm

A collection of raw data is full of systematic errors, accidental mistakes, misleading black swans, and false trails (some of which get followed for years before they finally turn out to be false). I've seen several talented, well-trained, and highly experienced scientists fool themselves for decades with their own raw data.

I've done a lot of data cleaning over the years, some of it geological. Yeah, there are usually some problems. But I've never had one that I couldn't resolve. Your post implicitly assumes that only one or the other can be published. Not true. As a condition of receiving grant money, document and publish ALL raw data and any cleaned data, in addition. The interwebs still has a few bits left to hold the extra.

Drug companies put up with that and more.

If you're going to insist that we spend $100s of billions because of your conclusions ....