Hacker News new | ask | show | jobs
by akarve 2455 days ago
Oh I would love to get the UH Cancer Center data into Quilt! Do you happen to have an S3 bucket with that data live? If the bucket is publicly permissioned it should "just work." We can talk about indexing the data for search. We are comfortable in the TB-PB range :)

I will look more closely at Ohayo.

2 comments

> Do you happen to have an S3 bucket with that data live?

No. However, I'm helping start the Data Curation Core at the AIPHI here (https://aiphi.shepherdresearchlab.org/). Our intent is to be a one stop shop for all medical data in Hawaii. We don't yet have a plan on where we will actually store the public datasets (have solutions for private data), but it sounds like from what you folks are saying S3 is the place, and we should link to it via Quilt. That sounds like a good plan to me.

On a related topic, we just had a paper accepted ("Maternal Cardiovascular-Related Single Nucleotide Polymorphisms, Genes and Pathways Associated with Early-Onset Preeclampsia") with a smaller dataset (in the low TB IIRC) where we were unable to put the data live online publically for privacy reasons, so instead created a strongly typed schema for the data and wrote a method "synthesizeProgram()" to generate fake but correctly typed data so we could publish working code, and other researchers could just swap out the CSVs to get real results. Perhaps that might be a good thing to integrate into Quilt.

We have a data curators program on Quilt and I encourage you to apply (page bottom on open.quiltdata.com). For high-value public data sets, AWS's registry of open data will, if accepted, cover the costs of storage and egress. We went through this process with Allen Cell and I'm happy to help.
Great! Done. I'm on the mainland the rest of this month but would love to chat sometime in October.
I just want to give a plug for sharing data in the public cloud and S3 in particular. Jed Sundwall (AWS Global Open Data Lead) sums it up really well: "The cloud completely changes the dynamic for sharing data. When data is shared in the cloud, researchers no longer have to worry about downloading or copying data before getting to work. Instead, they can deploy compute resources on-demand in the cloud, where a single copy of the data is made available. It is much more efficient to move algorithms to where the data is, than to move the data to where the algorithms are, and this makes it cheaper for researchers to ask more questions and experiment often." See the full whitepaper here: https://s3-us-west-2.amazonaws.com/opendata.aws/AWS_Sharing_...
> "It is much more efficient to move algorithms to where the data is, than to move the data to where the algorithms are"

I love this quote, thanks. I do try to do things in the cloud as much as possible, but often times it's more practical for TCO reasons to do things locally.

This quote makes me wonder if in the future we'll see some sort of external SSDs with a RasberyPi-like portable GPU hooked up. Some sort of dedicated Storage+Computer USB hybrid.

What I like about our schema/anonymization solution, is you can put fake data and real code online, and then people can make changes to the real code on the cloud, and you can run those reliably on data locally.

There's no doubt that local processing is a lot cheaper than the cloud for a lot of workloads.

That's a very interesting pattern--publishing "fake" (perhaps safe or anonymized) data online along with code to spur research and development then running the enhanced code locally on private (e.g., PII data) on local compute resources.

We hope Quilt packages can play a role to make that easier. The package serves as an interface and layer of abstraction between the code and the data so the same code can be run against the safe or private data.

"At the UH Cancer Center we routinely deal with datasets in the TB - PB range ..."

...

"Do you happen to have an S3 bucket with that data live?"

As someone not working in academia (or in this field at all) can you help me understand the question you have just asked ?

Specifically, wouldn't it be tremendously profligate for them to have that PB range dataset living in S3 ?

Given the resources that a university has (in both Internet2 connectivity, hardware budget and (relatively) cheap manpower), why would they ever store that data outside of their own UH datacenter ?

If the answer is "offsite backup" wouldn't it be glacier or nearline or ... anything but S3 ?

Good questions. First, services like open.quiltdata.com and Amazon's Registry of Open Data cover the S3 costs for public data. So that's one incentive. Second, the cost of cloud resources are highly competitive (if not superior) to on-premise data centers (see https://twitter.com/mohapatrahemant/status/11024016152632238... I don't think it's correct to think of S3 as expensive.

There are many ways to shave S3 costs (e.g. intelligent tiering, glacier), but at some point the data become so slow to access that you can't offer a pleasant user experience around browsing, searching, and feeding pipelines.

Most importantly, the "my data, my bucket" strategy gives users control over their data. A university with their own bucket has more control over their data than they do if Google, Facebook, etc. host and monetize it.

> If the answer is "offsite backup" wouldn't it be glacier or nearline or ... anything but S3 ?

Well, technically, S3 Glacier and S3 Glacier Deep Archive is still S3, Cloud Storage Nearline is similar, except it's a tier on Google's S3-equivalent service.

But lots of public charities, especially academic institutions, host data in a way conveniently accessible to the public via well-known convenient APIs, including S3, even when it is not the least expensive method possible viewed strictly from the cost of storage and institution-internal access because of their mission.

+1 for everything Aneesh said, but I also wanted to add that the public cloud offers opportunities in data sharing that academia hasn't yet provided, specifically the ability for collaborators to bring their code to the data. I posted a quote from Jed Sundwall, Global Open Data Lead at AWS in another thread. I think he really nails it when he says that the cloud "completely changes the dynamic for sharing data."

There certainly have been efforts in academia to provide shared computing resources. Cyverse (https://www.cyverse.org/about) comes to mind. At Wisconsin many researchers shared clusters using Condor. But, none to my knowledge come close to the scale, reliability and features of AWS and the other major cloud providers.