Hacker News new | ask | show | jobs
by a-dub 1441 days ago
> 1) How do you handle research in domains where the data is about people, so that releasing it harms their privacy?

that's an interesting problem that i have not thought about.

i think maybe that this is not a technical problem, but more an ethical one. under the open data approach, if you want to study humans you probably would need to get express informed consent that indicates that their data will be public and that it could be linked back to them.

> 2) How does one define "re-runnable processing"? Software rots, dependencies disappear, operating systems become incompatible with software, permission models change. Does every researcher now need a docker expert to publish? Who verifies that something is re-runnable, and how are they paid for it?

one defines it by building a specialized system for the purpose of reproducible research computing. i would envision this as a sort of distributed abstract virtual machine and source code packaging standard where the entire environment that was used to process the data is packaged and shipped with the paper. the success of this system would depend on the designers getting it right such that researchers _wouldn't_ have to worry about weird systems level kludges like docker. as it would behave as a hermetically sealed virtual machine (or cluster of virtual machines), there would be no concerns about bitrot unless one needed to make changes or build a new image based on an existing one.

the good news is that most data processing and simulation code is pretty well suited to this sort of paradigm. often it just does cpu/gpu computations and file i/o. internet connectivity or outside dependencies are pretty much out of scope.

i don't think it's hard... there just hasn't been the will or financial backing to build this out right and therefore it does not exist.

3 comments

> i think maybe that this is not a technical problem, but more an ethical one. under the open data approach, if you want to study humans you probably would need to get express informed consent that indicates that their data will be public and that it could be linked back to them.

As someone who wants science to advance, I want highly trusted researchers to be able to do studies that involve my private, personal data, that I would not consent to being public and linked back to me.

It is highly important to me that we allow these studies to not use open data.

A great example of this is the US college scorecard, which uses very private tax returns to measure how much college degrees and majors contribute to income (not the only value of college education, but certainly an important one):

https://collegescorecard.ed.gov/

Only high degrees of trust allowed this data to be published on extremely private information, and I think that makes for a better world. I am pro-open data, but research on non-open data should absolutely exist.

For instance, should any research about mental health for transgender people be abolished? Because anything on that subject is not going to be open, or at the least those who would be open to their data being public are a probably non-representative subset.

> get express informed consent that indicates that their data will be public and that it could be linked back to them.

10~20 years ago I could see it. Nowadays it’s a tough ask that would severly limit the number of people participating. This could also steer away most minority groups, which would make the research not only limited, but also misleading (we’d still draw conclusions from them, and decide policies accordingly, even as they come from grossly biased participant pools)

Aside from just the public aspect of having ones data in the open, there is also second/third order discoveries that would happen from there (e.g knowing someone’s cooking habits could be enough to deduce overall health status, potentially chronic illness, ethnicity/religion, relationship status etc.)

It does exist. It's called GNU Guix.