Is the time required to upload data to the cloud ever a problem with these solutions? Of course, it depends on what you are trying to do, but suppose you were working with thousands of genomes?
The sequencers can stream to a data analysis center as the data is being generated.
It takes a 100mbit stream/$1M of sequencing capital, so network connectivity to transfer to a data center is a tiny tiny cost of the whole ordeal.
However, paying for AWS storage is pretty prohibitive, unless you're at a small scale. So big centers will build their own storage facilities.
The small data producers like the ones that the thread author talks about can use often use AWS more cost efficiently than building a compute cluster. However, they need to budget for that, which is not always thought of. They may also need to fight their institute's core center so that they can use DNANexus.
S3 storage is pretty cheap, it's the data egress that really costs.
For academic centers though there is often an incentive to move things in house due to different treatment for capital expenditures and the opportunity to externalize some of your costs from your grant onto central services.
Data transfer is less than a single year of Glacier storage, so while it's pricy I wouldn't egress a major portion of the cost.
Keeping this data for less than 5-10 years is pretty questionable, since it's so expensive to generate. Eventually it may be cheaper to store the DNA and resequence when if it needs to be looked at again. However, if you're doing petabytes of storage, it's going to me much more economical to have your own storage and compute than to use AWS. Particularly at the rate that academic centers pay for sysadmins.
Running a public data portal our egress is higher than our storage costs. (We now proxy downloads through a direct connect to our university network...)
Remember to account for future reductions in storage costs. S3 has come down from $0.1500/GB month in 2010 to $0.0300/GB month today. And the recently introduced infrequent access storage tier is under half that again at $0.0125/GB month. It's now significantly cheaper to use S3/Azure/Google than running the storage ourselves.
Sure, all the time. Network and I/O are the biggest blockers for sequence analysis. For any organization that is working with thousands of genomes they probably have their own compute resources. I know of at least one organization who is currently sending thousands of genomes to the cloud for analysis, so it's certainly feasible to some extent.
Any chance we could follow up on this? I'm conducting research in this space and would love to have a chat regarding barriers to such large-scale analysis.
If you could message me at hngenometemp@forward.cat that would be terrific!
It takes a 100mbit stream/$1M of sequencing capital, so network connectivity to transfer to a data center is a tiny tiny cost of the whole ordeal.
However, paying for AWS storage is pretty prohibitive, unless you're at a small scale. So big centers will build their own storage facilities.
The small data producers like the ones that the thread author talks about can use often use AWS more cost efficiently than building a compute cluster. However, they need to budget for that, which is not always thought of. They may also need to fight their institute's core center so that they can use DNANexus.