Hacker News new | ask | show | jobs
by breck 2455 days ago
> Do you happen to have an S3 bucket with that data live?

No. However, I'm helping start the Data Curation Core at the AIPHI here (https://aiphi.shepherdresearchlab.org/). Our intent is to be a one stop shop for all medical data in Hawaii. We don't yet have a plan on where we will actually store the public datasets (have solutions for private data), but it sounds like from what you folks are saying S3 is the place, and we should link to it via Quilt. That sounds like a good plan to me.

On a related topic, we just had a paper accepted ("Maternal Cardiovascular-Related Single Nucleotide Polymorphisms, Genes and Pathways Associated with Early-Onset Preeclampsia") with a smaller dataset (in the low TB IIRC) where we were unable to put the data live online publically for privacy reasons, so instead created a strongly typed schema for the data and wrote a method "synthesizeProgram()" to generate fake but correctly typed data so we could publish working code, and other researchers could just swap out the CSVs to get real results. Perhaps that might be a good thing to integrate into Quilt.

2 comments

We have a data curators program on Quilt and I encourage you to apply (page bottom on open.quiltdata.com). For high-value public data sets, AWS's registry of open data will, if accepted, cover the costs of storage and egress. We went through this process with Allen Cell and I'm happy to help.
Great! Done. I'm on the mainland the rest of this month but would love to chat sometime in October.
I just want to give a plug for sharing data in the public cloud and S3 in particular. Jed Sundwall (AWS Global Open Data Lead) sums it up really well: "The cloud completely changes the dynamic for sharing data. When data is shared in the cloud, researchers no longer have to worry about downloading or copying data before getting to work. Instead, they can deploy compute resources on-demand in the cloud, where a single copy of the data is made available. It is much more efficient to move algorithms to where the data is, than to move the data to where the algorithms are, and this makes it cheaper for researchers to ask more questions and experiment often." See the full whitepaper here: https://s3-us-west-2.amazonaws.com/opendata.aws/AWS_Sharing_...
> "It is much more efficient to move algorithms to where the data is, than to move the data to where the algorithms are"

I love this quote, thanks. I do try to do things in the cloud as much as possible, but often times it's more practical for TCO reasons to do things locally.

This quote makes me wonder if in the future we'll see some sort of external SSDs with a RasberyPi-like portable GPU hooked up. Some sort of dedicated Storage+Computer USB hybrid.

What I like about our schema/anonymization solution, is you can put fake data and real code online, and then people can make changes to the real code on the cloud, and you can run those reliably on data locally.

There's no doubt that local processing is a lot cheaper than the cloud for a lot of workloads.

That's a very interesting pattern--publishing "fake" (perhaps safe or anonymized) data online along with code to spur research and development then running the enhanced code locally on private (e.g., PII data) on local compute resources.

We hope Quilt packages can play a role to make that easier. The package serves as an interface and layer of abstraction between the code and the data so the same code can be run against the safe or private data.