| HN Mirror

Disclaimer - I'm the Cloud Dataproc PM. :)

Super good question.

For most use cases, GCS is going to give you better performance than using PD. GCS removes some headaches, like replication. In so doing, when you read from GCS you can often read from a large number of places in parallel; you're also less likely to run into bottlenecks because the VMs have pretty freaking amazing network bandwidth.

To be fair, there are some exceptions:

1: Small files (several KB to maybe 1MB) 2: Many reads

With GCS, you pay the tax for network overhead + SSL, which means reads are slower and scanning tons of small files can also be less performant.

For Cloud Dataproc, we also do provision HDFS on PD for intermediate output, because it's usually more cost effective than trying to do everything in GCS (which you can do, but you're going to pay for class A ops, and we want to save people money.)

To go into another level of detail, GCS also used to lack immediate list after write consistency. Now that GCS has been changing to support it, you also get a good story on clusters hitting the same bucket without hacks (in our case we used an NFS cache on the cluster.)

Dennis Huo, the tech lead manager for Dataproc also gave a pretty good talk covering some of this at Google Cloud Next [1]. I also gave a talk w/ Michael Yu on the GCS team [2] talking about a change we're in the process of making - swapping out the Java SSL provider with one from Conscrypt which nets about a 2x improvement in reads, making a great thing even better. :)

1: https://www.youtube.com/watch?v=NfxvjWSgplU 2: https://www.youtube.com/watch?v=SC_4qO-BIjc