Hacker News new | ask | show | jobs
by adminprof 1427 days ago
I think you're missing two fatal problems in this "publish all raw data and code" mindset. I don't think the desire of commercialization is high on the list of fatal problems preventing people from publishing data+software.

1) How do you handle research in domains where the data is about people, so that releasing it harms their privacy? Healthcare, web activity, finances. Sure you can try to anonymize it, anonymization is imperfect, and even fully anonymized data can be joined to other data sources to de-identify people; k-anonymity only works in a closed ecosystem. If we live in a world where search engine companies don't publish their research because of this constraint, that seems worse than the current system.

2) How does one define "re-runnable processing"? Software rots, dependencies disappear, operating systems become incompatible with software, permission models change. Does every researcher now need a docker expert to publish? Who verifies that something is re-runnable, and how are they paid for it?

3 comments

> 1) How do you handle research in domains where the data is about people, so that releasing it harms their privacy?

that's an interesting problem that i have not thought about.

i think maybe that this is not a technical problem, but more an ethical one. under the open data approach, if you want to study humans you probably would need to get express informed consent that indicates that their data will be public and that it could be linked back to them.

> 2) How does one define "re-runnable processing"? Software rots, dependencies disappear, operating systems become incompatible with software, permission models change. Does every researcher now need a docker expert to publish? Who verifies that something is re-runnable, and how are they paid for it?

one defines it by building a specialized system for the purpose of reproducible research computing. i would envision this as a sort of distributed abstract virtual machine and source code packaging standard where the entire environment that was used to process the data is packaged and shipped with the paper. the success of this system would depend on the designers getting it right such that researchers _wouldn't_ have to worry about weird systems level kludges like docker. as it would behave as a hermetically sealed virtual machine (or cluster of virtual machines), there would be no concerns about bitrot unless one needed to make changes or build a new image based on an existing one.

the good news is that most data processing and simulation code is pretty well suited to this sort of paradigm. often it just does cpu/gpu computations and file i/o. internet connectivity or outside dependencies are pretty much out of scope.

i don't think it's hard... there just hasn't been the will or financial backing to build this out right and therefore it does not exist.

> i think maybe that this is not a technical problem, but more an ethical one. under the open data approach, if you want to study humans you probably would need to get express informed consent that indicates that their data will be public and that it could be linked back to them.

As someone who wants science to advance, I want highly trusted researchers to be able to do studies that involve my private, personal data, that I would not consent to being public and linked back to me.

It is highly important to me that we allow these studies to not use open data.

A great example of this is the US college scorecard, which uses very private tax returns to measure how much college degrees and majors contribute to income (not the only value of college education, but certainly an important one):

https://collegescorecard.ed.gov/

Only high degrees of trust allowed this data to be published on extremely private information, and I think that makes for a better world. I am pro-open data, but research on non-open data should absolutely exist.

For instance, should any research about mental health for transgender people be abolished? Because anything on that subject is not going to be open, or at the least those who would be open to their data being public are a probably non-representative subset.

> get express informed consent that indicates that their data will be public and that it could be linked back to them.

10~20 years ago I could see it. Nowadays it’s a tough ask that would severly limit the number of people participating. This could also steer away most minority groups, which would make the research not only limited, but also misleading (we’d still draw conclusions from them, and decide policies accordingly, even as they come from grossly biased participant pools)

Aside from just the public aspect of having ones data in the open, there is also second/third order discoveries that would happen from there (e.g knowing someone’s cooking habits could be enough to deduce overall health status, potentially chronic illness, ethnicity/religion, relationship status etc.)

It does exist. It's called GNU Guix.
> 2) How does one define "re-runnable processing"? Software rots, dependencies disappear, operating systems become incompatible with software, permission models change. Does every researcher now need a docker expert to publish? Who verifies that something is re-runnable, and how are they paid for it?

This is always a problem even with some of the most open scientific code.

Requiring that the code be published, and perhaps a peer-reviewer to run it just once with a bit of support to ensure that the submitters aren't completely bullshitting, before the paper gets approved to be published, might be a good start.

From my experience in the digital health sector, concerns for privacy is always the reason given for not sharing anything valuable and/or useful to others. But it's just a convenient way of hiding the 'desire of commercialisation'.
this is also true, and it also runs within science itself. if someone spends two years collecting some data that is very hard to collect and it has a few papers worth of insights within it, they're going to want to keep that data private until they can get those papers out themselves lest someone else come along, download their data and scoop them before they have a chance to see the fruits of their hard labor.

while it's not great for science at large, i don't blame them either.

It's solvable if publishing the dataset counts as a paper, and citations of the dataset which should be required count as citations for e.g. tenure.

For example, ImageNet for machine learning is a very expensive and difficult data set to produce that has resulted in revolutionary advances in machine learning. And people build models on it, cite their results as evidence their models are good, and cite the paper.

This is an interesting idea. Although I am afraid that publishing a dataset, even a good one, will not be considered "real science" by our (broken) institutions.
You have a valid point here. It's probably utopian, but to me the only reasonable answer to this is to acknowledge that science is a collective process. Of course, this goes against the (stupid) idea that some extremely deserving geniuses are the ones that make science...