|
|
|
|
|
by epistasis
1052 days ago
|
|
Most of these tools treat the "local file" as a stream which can be a pipe to a network stream from the object store. The files that are not streamed and need random access are often better on a local ephemeral SSDs or in RAM after a fetch of the, say, 50GB hash table, or whatever it is. At least, that's my experience: streams and in-RAM pre-processed DBs are >99% of file IO. |
|
Most of these applications depend on OS optimizations that have been made over the decades; multithreaded readers, readahead, and caching are critically important to read performance. In principle, a remote storage system could be as fast as a local disk. This includes random access. after all, the storage system is just a bunch of drives attached to machines connected by networks.
When I worked at Google I wrote a mapreduce that converted BAM files to sstables which are sorted, sharded by key, and sit in an object store like S3. Once the files were in sstables (or columnio) we could do realtime analytics using modern tools.