| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pocketsand 895 days ago

Things are complicated.

To be fair: Everyone I've worked closely with in research has gone above and beyond not to cut corners and produce high quality data and research.

What I have in mind here is a situation where people are actually quite careful but can still end up in a place where they don't know what happened because they don't have good systems for creating datasets and storing code.

For example, graduate students are not always taught to work in a reproducible way. It's definitely gotten better from what I can see, but it was normal for people to get source data and work that data into its final form in a lot of different steps, but not always reproducible steps. E.g., data comes in from secondary source or other provider. It gets cleaned. That file gets saved as something like "clean data 011234.csv".

More work is done, it gets saved again.

Time passes, things are revisited, and a handful of files exist that likely with some care could lead from point A to point B. But the exact process, to say nothing of the dozens, sometimes hundreds, of small decisions data preparation decisions get lost to memory.

Code doesn't go in version control. People get new computers. USBs get lost. Universities migrate to new data systems and so on.

All the while, these students and researchers were very careful while doing the work. They were just never trained to use good version control and pipeline processes. They basically do what they did with papers they write. Save and backup while working through the paper and move on when it's done.

This is made worse when data is proprietary or not legally shareable.

So people aren't necessarily being shoddy or doing bad work, they're just not using good systems.

1 comments

lusus_naturae 895 days ago

> So people aren't necessarily being shoddy or doing bad work, they're just not using good systems.

Agreed. I think there isn’t an incentive to do this because reproducibility takes a back seat to so many other concerns. Unless PIs are told that their publication chances depend on reproducibility, this isn’t going to change.

link