Hacker News new | ask | show | jobs
by dekhn 1091 days ago
I talked with NIH program managers and leadership about this quite some time ago and tried to convince them to explicitly fund long-term data hosting for science publications. The budgeting got complicated quickly: paying for a bucket with 25TB isn't a huge expense. Does NIH cut a deal with AWS, or GCP, or Azure, or (shudder) Oracle so they get discounts (these can halve or even more reduce storage settings). Does it go in long-term storage or live storage? And that doesn't even include egress- which in my experience, when you have lots of downloaders, gets expensive fast. Does NIH cover the distribution costs, or do requesters pay? Do people running high throughput jobs in clusters near the S3 storage get to work directly against the data or do they make a duplicate?

Their response in this instance was to fund SRA-in-the-cloud and other ventures, such as having PIs in well-connected locations like U.Chicago rent datacenter space in an exchange, negotiate very cheap hosting and bandwidth, and then give people access to compute and data either there, or in AWS (https://www.uchicagomedicine.org/forefront/news/university-o...)

This still doesn't address the "high quality metadata problem", which IMHO is the NP-complete problem of biology.

1 comments

The way that grants and funding work right now, from what I've seen, is incompatible with this kind of thing. It's by definition a maintenance project. They do not want to fund maintenance projects that go past the timeline of the grant like five years or whatever. No PI is going to win the Nobel prize for making some kind of janitorial service for scientists. Also the project has zero chance of curing cancer by itself. Although the PI who eventually cures every cancer might use it, they will be PI on other grants not that grant.