While both Spark and BigQuery do the shuffle step in-memory, there are some differences[1]:
- BigQuery's execution is pipelined (don't wait for step 1 to finish to start step 2)
- BigQuery's in-memory shuffler is not co-tenant to compute
And, of course, it's one thing to have software and hardware. BigQuery provides a fully-managed, fully-encrypted, HA, redundant, and constantly seamlessly maintaned and upgraded service [2].
Pipelining helps with hanging chads, tail latency of work steps. If you have a slow worker (due to, say, data skew), entire job slows down. All other workers are sitting idle, waiting for the one worker to finish their piece. Read [0] to see what Dataflow does, and BigQuery/Dremel do very similar stuff to deal with this issue. BigQuery also doesn't have to wait for ALL workers to finish step 1 before proceeding to step 2.
By co-tenant to compute, I mean that processing nodes themselves handle the shuffle in Spark. This can cause non-obvious bottlenecks. BigQuery handles shuffle outside of the processing nodes [1].
Sorry - what does co-tenant to compute mean?