|
|
|
|
|
by adminprof
1427 days ago
|
|
I think you're missing two fatal problems in this "publish all raw data and code" mindset. I don't think the desire of commercialization is high on the list of fatal problems preventing people from publishing data+software. 1) How do you handle research in domains where the data is about people, so that releasing it harms their privacy? Healthcare, web activity, finances. Sure you can try to anonymize it, anonymization is imperfect, and even fully anonymized data can be joined to other data sources to de-identify people; k-anonymity only works in a closed ecosystem. If we live in a world where search engine companies don't publish their research because of this constraint, that seems worse than the current system. 2) How does one define "re-runnable processing"? Software rots, dependencies disappear, operating systems become incompatible with software, permission models change. Does every researcher now need a docker expert to publish? Who verifies that something is re-runnable, and how are they paid for it? |
|
that's an interesting problem that i have not thought about.
i think maybe that this is not a technical problem, but more an ethical one. under the open data approach, if you want to study humans you probably would need to get express informed consent that indicates that their data will be public and that it could be linked back to them.
> 2) How does one define "re-runnable processing"? Software rots, dependencies disappear, operating systems become incompatible with software, permission models change. Does every researcher now need a docker expert to publish? Who verifies that something is re-runnable, and how are they paid for it?
one defines it by building a specialized system for the purpose of reproducible research computing. i would envision this as a sort of distributed abstract virtual machine and source code packaging standard where the entire environment that was used to process the data is packaged and shipped with the paper. the success of this system would depend on the designers getting it right such that researchers _wouldn't_ have to worry about weird systems level kludges like docker. as it would behave as a hermetically sealed virtual machine (or cluster of virtual machines), there would be no concerns about bitrot unless one needed to make changes or build a new image based on an existing one.
the good news is that most data processing and simulation code is pretty well suited to this sort of paradigm. often it just does cpu/gpu computations and file i/o. internet connectivity or outside dependencies are pretty much out of scope.
i don't think it's hard... there just hasn't been the will or financial backing to build this out right and therefore it does not exist.