We need to build the GitHub of scientific data

> Some will argue that we already have solutions for this

DataCite is the DOI registration agency for datasets. https://datacite.org/

They work very closely with Crossref, the DOI registration agency for scholarly works.

They also have a federated membership structure which includes data centres such as CERN etc. https://datacite.org/members/

For an example on current work to incentivise better connectedness of data citation, see Make Data Count: https://makedatacount.org/

It's not perfect, but there's a lot of work, both technical and community coordination, to improve the sharing and connectedness of datasets.

(disclosure: former employee of Crossref, spent some time on DataCite technical steering group)

Solutions for specific problems I mentioned do exist for niches. But none of them can solve it well for all niches, which is what I believe is necessary. What we need is for all datasets from scientific papers to be easily accessible and licensed like code.

I think the diversification is a strength, honestly.

CERN and high-energy physics has _massive_ datasets. Making them all available on-line isn't practical.

Other researchers may have one or two files that they want to cite as part of a paper.

Healthcare research may have confidential data for which there are specific types of access control required.

I don't think GitHub would be financially sustainable or scalable if it was able to host millions of one-file repos, alongside repos that grow terabytes per day, alongside those that hold highly sensitive data.

Asraelite 637 days ago

There's a lot of things that don't fit on GitHub either. Sometimes because it's closed source, sometimes because the data is too big, sometimes because parts of the data have legal restrictions on distribution and require the user to get it themselves from a different source.

The usual solution is to make a skeleton repo with only partial or no code, the real substance being a README that explains what the project is and instructions on how to use it. GitHub is a social network as well as a code warehouse in a way, and this comes with benefits. The same system for stars, issues, user groups, permissions etc. extends across all projects regardless of whether the code/data is actually hosted on GitHub. Something like this for science could be of huge benefit.

At the end of the day, we need scientific research to be reproducible. If you are using some confidential dataset for making conclusions, how will people check if what they are saying is true or not? You have to show your experiment to publications like Nature or Elsevier etc, in order for you to get recognition. I believe the standard should be that anyone can check, if they want. There could be some caveats, but I believe, in most cases, scientific research should be reproducible and the dataset used is very important for reproducibility.

You are making quite broad statements, and they don't seem to take into account the diversity of research and scholarly practice. A lot of what you suggest is happening already, but it's far from perfect. The existing solutions all have trade-offs (legal, cost, social, technological) .

I think it would make for a stronger argument to acknowledge and identify the existing solutions and practice, and evaluate them against your criteria.

yawnxyz 638 days ago

I’ve always thought of them as “official datasets” - is that right?

With GitHub anyone can just chuck something up there with not much forethought, but not so much with DataCite right?

There is a perception that DOIs signify something being 'official', whatever that means. But in the publishing ecosystem (e.g. Crossref), it's whatever the publisher decides to publish. There's a wide range of publishers who publish anything from books to peer-reviewed works to preprints.

Two examples of DataCite members are OSF.io and Figshare.com . They have very different models. Figshare lets you share and assign DOIs to arbitrary research files. OSF lets you version-control your research and assign a DOI to the project.

yawnxyz 638 days ago

I’ve been a bioinformatics engineer for a while.

Publishing is narrative building, and wades through lots and lots of experimental data. Most of those aren’t super interesting.

It’s like let’s say you record everything that’s ever been said around you for a few weeks. Lots of audio.

Then you set out to tell a story about how those weeks have changed you. You find quotes and ideas that shape your narrative. And let’s say you have a good one! And you publish it, and people love your story. You have a few polished sound bites. They support your story.

What about all the other audio? Some would say that’s valuable data, others would say it’s noise. It would also potentially take 10x the work to clean and publish, and without a narrative tied around it, you would have a hard time making it useful. Most of it would be literally noise.

(Unless your Diddy and store all the tapes and then one day FBI comes knocking. Unfortunately the NIH and FDA doesn’t have their own federal enforcement arm)

You don't have to post the noise. If you are publishing a paper, you already have a solid experiment in place. What is necessary is a way to reproduce that research, and the final dataset used is an important piece of the puzzle. Of course, if the changelog of the data exists, that might be useful, just to see if the authors are modifying the data to cherry-pick the results they are publishing.

freitzzz 638 days ago

Isn’t Dolt one of the initiatives for this?

https://www.dolthub.com/

Also not so long time ago, I saw a post on HN sharing a platform that does just this but for hospital data. can’t find tho

There could be field-specific ones like Dolthub, but what I believe is needed is datasets for all fields. GitHub isn't field-specific. There is no GitHub for hospitals, physics, AI, etc. It's any field.

mixeden 638 days ago

I'm just wondering why this data can't be hosted on HuggingFace?

Huggingface isn't meant for all scientific data, it's mostly datasets for a niche. They do an excellent job though.