Hacker News new | ask | show | jobs
by rmnoon 3305 days ago
What about the traditional IO / data locality win of having your processing colocated with your DFS? Is GCS bandwidth that amazing?
2 comments

Good question! You probably are familiar with the bandwidth and throughput power of the underlying storage system of GCS, Colossus, through use of BigQuery. BigQuery Storage and GCS storage both leverage Colossus. It's silly fast :)

Others can chime in more intelligently wrt Spark/Hadoop specifically, but I'll point out that read latency from GCS would definitely be higher than local-disk HDFS (esp Local SSD). Throughput, depending on your configuration, could be much better with GCS. Spark/Hadoop don't take the same care to optimize the storage-to-compute route as BigQuery, as evident by some bits of Hive performing serial FS operations.

So my answer is, it depends on the configuration of the job, the cluster, how data is written, choice of disk, and et cetera.

That said, when talking about price-performance, flexibility, scalability, and ease of operations, I suspect the "job-scoped clusters" setup would have a far superior TCO. We should try and do the math one day :)

(co-author of blog, work at G)

Disclaimer - I'm the Cloud Dataproc PM. :)

Super good question.

For most use cases, GCS is going to give you better performance than using PD. GCS removes some headaches, like replication. In so doing, when you read from GCS you can often read from a large number of places in parallel; you're also less likely to run into bottlenecks because the VMs have pretty freaking amazing network bandwidth.

To be fair, there are some exceptions:

1: Small files (several KB to maybe 1MB) 2: Many reads

With GCS, you pay the tax for network overhead + SSL, which means reads are slower and scanning tons of small files can also be less performant.

For Cloud Dataproc, we also do provision HDFS on PD for intermediate output, because it's usually more cost effective than trying to do everything in GCS (which you can do, but you're going to pay for class A ops, and we want to save people money.)

To go into another level of detail, GCS also used to lack immediate list after write consistency. Now that GCS has been changing to support it, you also get a good story on clusters hitting the same bucket without hacks (in our case we used an NFS cache on the cluster.)

Dennis Huo, the tech lead manager for Dataproc also gave a pretty good talk covering some of this at Google Cloud Next [1]. I also gave a talk w/ Michael Yu on the GCS team [2] talking about a change we're in the process of making - swapping out the Java SSL provider with one from Conscrypt which nets about a 2x improvement in reads, making a great thing even better. :)

1: https://www.youtube.com/watch?v=NfxvjWSgplU 2: https://www.youtube.com/watch?v=SC_4qO-BIjc