Hacker News new | ask | show | jobs
by epistasis 1059 days ago
Working in genomics, I've dealt with lots of petabyte data stores over the past decade. Having used AWS S3, GCP GCS, and a raft of storage systems for collocated hardware (Ceph, Gluster, and an HP system whose name I have blocked from my memory), I have no small amount of appreciation for the effort that goes into operating these sorts of systems.

And the benefits of sharing disk IOPs with untold numbers of other customers is hard to understate. I hadn't heard the term "heat" as it's used in the article but it's incredibly hard to mitigate on single system. For our co-located hardware clusters, we would have to customize the batch systems to treat IO as an allocatable resource the same as RAM or CPU in order to manage it correctly across large jobs. S3 and GCP are super expensive, but the performance can be worth it.

This sort of article is some of the best of HN, IMHO.

4 comments

It also explains some of the cost model for cloud storage. The best possible customer, from a cloud storage perspective, stores a whole lot of data but reads almost none of it. That's kind of like renting hard drives, except if you only fill some of each hard drive with the "cold" data, you can still use the hard drive's full I/O capacity to handle the hot work. So, if you very carefully balance what sort of data is on which drive, you can keep all of the drives in use despite most of your data not being used. That's part of why storage is comparatively cheap but reads are comparatively expensive.
You get similar properties/challenges in lots of multi consumer storage scenarios. I learned lots of similar lessons working on CDNs when it comes to object distribution and access rates.

If youre interested go search for some of the published work from "Coho Data", they had some great usenix presentations IIRC. This was the previous company Andy Warfield was at and they had an emphasis on effective tracking & prediction of IO workloads across very large datasets.

Unfortunately many tools in genomics (and biotech in general) still depend on local filesystems- and even if they do support S3, performance is far slower than it could be.
Most of these tools treat the "local file" as a stream which can be a pipe to a network stream from the object store.

The files that are not streamed and need random access are often better on a local ephemeral SSDs or in RAM after a fetch of the, say, 50GB hash table, or whatever it is.

At least, that's my experience: streams and in-RAM pre-processed DBs are >99% of file IO.

I didn't make my statement out of ignorance.

Most of these applications depend on OS optimizations that have been made over the decades; multithreaded readers, readahead, and caching are critically important to read performance. In principle, a remote storage system could be as fast as a local disk. This includes random access. after all, the storage system is just a bunch of drives attached to machines connected by networks.

When I worked at Google I wrote a mapreduce that converted BAM files to sstables which are sorted, sharded by key, and sit in an object store like S3. Once the files were in sstables (or columnio) we could do realtime analytics using modern tools.

Right, most people that try to really optimize these things do not have access to the parallelism tools thay Google has built, and end up doing their own ad-hoc sharding schemes. Things that can be built by 1-3 people over the course of a few weeks tk solve ann immediate scaling problem. And of course BAM itself dates back to before standardized serialization formats were brought out of Google.

Even with potential optimizations, initiating a seek on GCS or S3 is far far slower than on a local SSD, so even if Google exposes fast cross-network seeks on objects inside an internal object store system, it is not readily accessible to the plebes like me and 99.9% of genomicists that use cloud systems or their own hardware.

You might be interested in our paper that just got published in Bioinformatics today as chance would have it: https://academic.oup.com/bioinformatics/advance-article-abst...
Thanks for sharing, I'll check it out. It would be interesting to see if it would help with these pretty astonishing Open Omics results that recently came out:

https://community.intel.com/t5/Blogs/Tech-Innovation/Artific...

Of course it’s slower. Your using https to do something that’s meant to be raw binary. The overhead is killing you. Something like iscsci, is pretty quick compared to https as a storage protocol.
Which is exactly why using standard Unix/POSIX files makes sense as a universal interface for genomics programs that are run in highly heterogeneous environments across the world, even if it leads to software engineers wishing that their internal custom data storage systems were used instead.

If random access is needed in a cloud environment, use either that local ephemeral SSD, or a cloud block device which is probably just an iSCSI implementation underneath, or at least a close equivalent.

Operating fleets of compute and IO in cloud environments means that POSIX semantics generally work really well for genomics.

Folks that have reimplemented basic genomics algorithms on top of protocol buffers standardized serialization still store them as BLOBs, and have not delivered benefits that can be realized in publicly available compute environments.

I've worked in web tech long enough to recognize that the overhead of HTTP does not explain the difference in performance between "raw binary" protocols and ones that have textual headers.

Put another way I've seen extremely low-latency https servers. The latency in S3 doesn't come from using https.

The latency is higher so the key is parallelism... Which means you need more cores/hardware/VMs/pick your poison. New but same problem...
Is single-job performance the only criterion? Or can you just run a bunch of different jobs at the same time (genomics has many embarassingly parallel problems, often per-sample) and use the higher aggregate storage bandwidth of your object store to get "more work done in unit time".
I did the latter, across usually about 4000+ CPUs. That got me a peak of 15 or so GB/sec read from one GCS bucket, writing to another.

But yeah if it's not something that can be paralleled then it sucks.

As someone in this area: we very much want to make your EiB of data to feel local. It's hard and I'm sorry we only have 3.5 9's of read availability.
People working on storage systems are doing amazing things. When I first heard about Ceph more than a decade ago, I immediately emailed one of the founders asking for an exabyte data store, because I knew just how amazingly difficult it would be and that it was very much needed.

3.5 9s is incredible on large stores. S3 and GCS are just amazing machines. I have nothing but admiration for the people that make this happen.

Some of the best HN indeed. Would love to see any links to HN posts that you think are similarly good!