Hacker News new | ask | show | jobs
by vgt 3579 days ago
While both Spark and BigQuery do the shuffle step in-memory, there are some differences[1]:

- BigQuery's execution is pipelined (don't wait for step 1 to finish to start step 2)

- BigQuery's in-memory shuffler is not co-tenant to compute

And, of course, it's one thing to have software and hardware. BigQuery provides a fully-managed, fully-encrypted, HA, redundant, and constantly seamlessly maintaned and upgraded service [2].

[1]https://cloud.google.com/blog/big-data/2016/08/in-memory-que...

[2]https://cloud.google.com/blog/big-data/2016/08/google-bigque...

(disc: work on BigQuery)

1 comments

Would pipelining help much when the processing job is CPU bound (all cores maxed out)?

Sorry - what does co-tenant to compute mean?

Pipelining helps with hanging chads, tail latency of work steps. If you have a slow worker (due to, say, data skew), entire job slows down. All other workers are sitting idle, waiting for the one worker to finish their piece. Read [0] to see what Dataflow does, and BigQuery/Dremel do very similar stuff to deal with this issue. BigQuery also doesn't have to wait for ALL workers to finish step 1 before proceeding to step 2.

By co-tenant to compute, I mean that processing nodes themselves handle the shuffle in Spark. This can cause non-obvious bottlenecks. BigQuery handles shuffle outside of the processing nodes [1].

[0] https://cloud.google.com/blog/big-data/2016/05/no-shard-left...

[1] https://cloud.google.com/blog/big-data/2016/08/in-memory-que...