| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by vgt 3305 days ago

Good question! You probably are familiar with the bandwidth and throughput power of the underlying storage system of GCS, Colossus, through use of BigQuery. BigQuery Storage and GCS storage both leverage Colossus. It's silly fast :)

Others can chime in more intelligently wrt Spark/Hadoop specifically, but I'll point out that read latency from GCS would definitely be higher than local-disk HDFS (esp Local SSD). Throughput, depending on your configuration, could be much better with GCS. Spark/Hadoop don't take the same care to optimize the storage-to-compute route as BigQuery, as evident by some bits of Hive performing serial FS operations.

So my answer is, it depends on the configuration of the job, the cluster, how data is written, choice of disk, and et cetera.

That said, when talking about price-performance, flexibility, scalability, and ease of operations, I suspect the "job-scoped clusters" setup would have a far superior TCO. We should try and do the math one day :)

(co-author of blog, work at G)