|
|
|
|
|
by lusus_naturae
890 days ago
|
|
> Anyone who's advised students or asked even presenting researchers such questions know that often people will literally not know what happened to all their data. I am sorry that’s been your experience, maybe it varies by field and quality of research? Most people I’ve questioned have provided reasonable answers to their findings. I don’t understand why anything needs to be assumed in bad faith or shoddily done. It’s all a bit Dunning-Kruger to me where everyone assumes that everyone else is doing shoddy or bad work. |
|
To be fair: Everyone I've worked closely with in research has gone above and beyond not to cut corners and produce high quality data and research.
What I have in mind here is a situation where people are actually quite careful but can still end up in a place where they don't know what happened because they don't have good systems for creating datasets and storing code.
For example, graduate students are not always taught to work in a reproducible way. It's definitely gotten better from what I can see, but it was normal for people to get source data and work that data into its final form in a lot of different steps, but not always reproducible steps. E.g., data comes in from secondary source or other provider. It gets cleaned. That file gets saved as something like "clean data 011234.csv".
More work is done, it gets saved again.
Time passes, things are revisited, and a handful of files exist that likely with some care could lead from point A to point B. But the exact process, to say nothing of the dozens, sometimes hundreds, of small decisions data preparation decisions get lost to memory.
Code doesn't go in version control. People get new computers. USBs get lost. Universities migrate to new data systems and so on.
All the while, these students and researchers were very careful while doing the work. They were just never trained to use good version control and pipeline processes. They basically do what they did with papers they write. Save and backup while working through the paper and move on when it's done.
This is made worse when data is proprietary or not legally shareable.
So people aren't necessarily being shoddy or doing bad work, they're just not using good systems.