Hacker News new | ask | show | jobs
by afandian 638 days ago
> Some will argue that we already have solutions for this

DataCite is the DOI registration agency for datasets. https://datacite.org/

They work very closely with Crossref, the DOI registration agency for scholarly works.

They also have a federated membership structure which includes data centres such as CERN etc. https://datacite.org/members/

For an example on current work to incentivise better connectedness of data citation, see Make Data Count: https://makedatacount.org/

It's not perfect, but there's a lot of work, both technical and community coordination, to improve the sharing and connectedness of datasets.

(disclosure: former employee of Crossref, spent some time on DataCite technical steering group)

2 comments

Solutions for specific problems I mentioned do exist for niches. But none of them can solve it well for all niches, which is what I believe is necessary. What we need is for all datasets from scientific papers to be easily accessible and licensed like code.
I think the diversification is a strength, honestly.

CERN and high-energy physics has _massive_ datasets. Making them all available on-line isn't practical.

Other researchers may have one or two files that they want to cite as part of a paper.

Healthcare research may have confidential data for which there are specific types of access control required.

I don't think GitHub would be financially sustainable or scalable if it was able to host millions of one-file repos, alongside repos that grow terabytes per day, alongside those that hold highly sensitive data.

There's a lot of things that don't fit on GitHub either. Sometimes because it's closed source, sometimes because the data is too big, sometimes because parts of the data have legal restrictions on distribution and require the user to get it themselves from a different source.

The usual solution is to make a skeleton repo with only partial or no code, the real substance being a README that explains what the project is and instructions on how to use it. GitHub is a social network as well as a code warehouse in a way, and this comes with benefits. The same system for stars, issues, user groups, permissions etc. extends across all projects regardless of whether the code/data is actually hosted on GitHub. Something like this for science could be of huge benefit.

At the end of the day, we need scientific research to be reproducible. If you are using some confidential dataset for making conclusions, how will people check if what they are saying is true or not? You have to show your experiment to publications like Nature or Elsevier etc, in order for you to get recognition. I believe the standard should be that anyone can check, if they want. There could be some caveats, but I believe, in most cases, scientific research should be reproducible and the dataset used is very important for reproducibility.
You are making quite broad statements, and they don't seem to take into account the diversity of research and scholarly practice. A lot of what you suggest is happening already, but it's far from perfect. The existing solutions all have trade-offs (legal, cost, social, technological) .

I think it would make for a stronger argument to acknowledge and identify the existing solutions and practice, and evaluate them against your criteria.

I’ve always thought of them as “official datasets” - is that right?

With GitHub anyone can just chuck something up there with not much forethought, but not so much with DataCite right?

There is a perception that DOIs signify something being 'official', whatever that means. But in the publishing ecosystem (e.g. Crossref), it's whatever the publisher decides to publish. There's a wide range of publishers who publish anything from books to peer-reviewed works to preprints.

Two examples of DataCite members are OSF.io and Figshare.com . They have very different models. Figshare lets you share and assign DOIs to arbitrary research files. OSF lets you version-control your research and assign a DOI to the project.