Hacker News new | ask | show | jobs
by jkh1 1092 days ago
Can I put my 25 TB microscopy image data set on GitHub? Will they host it for free indefinitely? In the life sciences, there are dedicated public repositories (databases) where the data is hosted (free to the researcher), catalogued, standardized and curated to some extent. These repositories are searchable and often crossreference each other. So you can find a data set even if you didn't know about it before. Putting data all over the internet in dumps like Zenodo, Dryad and the like is just not very useful. Advertizing your work is probably good for your career but this is not what makes your data and work useful to others. It's how easy you make it for others to understand, access and combine your data with their own data. This means providing data and metadata using open community standards (there are already a bunch of these in life sciences even if there are gaps in coverage).
4 comments

> "Can I put my 25 TB microscopy image data set on GitHub? Will they host it for free indefinitely? In the life sciences, there are dedicated public repositories (databases) where the data is hosted (free to the researcher), catalogued, standardized and curated to some extent."

Yes this is better. Like if you are a scientist and you find an amazing new cancer gene then you should probably put it in the actual gene repository and not randomly on github. Because you're this hypothetical scientist you probably know the exact niche place that is most appropriate for that already. Like GenBank or whatever is the one that there was a scandal about the pre-covid coronavirus genes were mysteriously deleted from it.

I talked with NIH program managers and leadership about this quite some time ago and tried to convince them to explicitly fund long-term data hosting for science publications. The budgeting got complicated quickly: paying for a bucket with 25TB isn't a huge expense. Does NIH cut a deal with AWS, or GCP, or Azure, or (shudder) Oracle so they get discounts (these can halve or even more reduce storage settings). Does it go in long-term storage or live storage? And that doesn't even include egress- which in my experience, when you have lots of downloaders, gets expensive fast. Does NIH cover the distribution costs, or do requesters pay? Do people running high throughput jobs in clusters near the S3 storage get to work directly against the data or do they make a duplicate?

Their response in this instance was to fund SRA-in-the-cloud and other ventures, such as having PIs in well-connected locations like U.Chicago rent datacenter space in an exchange, negotiate very cheap hosting and bandwidth, and then give people access to compute and data either there, or in AWS (https://www.uchicagomedicine.org/forefront/news/university-o...)

This still doesn't address the "high quality metadata problem", which IMHO is the NP-complete problem of biology.

The way that grants and funding work right now, from what I've seen, is incompatible with this kind of thing. It's by definition a maintenance project. They do not want to fund maintenance projects that go past the timeline of the grant like five years or whatever. No PI is going to win the Nobel prize for making some kind of janitorial service for scientists. Also the project has zero chance of curing cancer by itself. Although the PI who eventually cures every cancer might use it, they will be PI on other grants not that grant.
bittorrent could be an interesting fallback distribution method. There's already academictorrents