|
|
|
|
|
by dekhn
1053 days ago
|
|
I didn't make my statement out of ignorance. Most of these applications depend on OS optimizations that have been made over the decades; multithreaded readers, readahead, and caching are critically important to read performance. In principle, a remote storage system could be as fast as a local disk. This includes random access. after all, the storage system is just a bunch of drives attached to machines connected by networks. When I worked at Google I wrote a mapreduce that converted BAM files to sstables which are sorted, sharded by key, and sit in an object store like S3. Once the files were in sstables (or columnio) we could do realtime analytics using modern tools. |
|
Even with potential optimizations, initiating a seek on GCS or S3 is far far slower than on a local SSD, so even if Google exposes fast cross-network seeks on objects inside an internal object store system, it is not readily accessible to the plebes like me and 99.9% of genomicists that use cloud systems or their own hardware.