| HN Mirror

Large scale shuffles: Absolutely. One of the larger queries we ran saw a 450TB shuffle -- this may require more than just deploying the spark-rapids plugin, however (depends on the query itself and specific VMs used). Shuffling was the majority of the time and saw 100% (...99%?) GPU utilization. I presume this is partially due to compressing shuffle partitions. Network/disk I/O is definitely not the bottleneck here.

It's difficult to say what "workloads" are significant, and easier to talk about what doesn't really work AFAIK. Large-scale shuffles might see 4x efficiency, assuming you can somehow offload the hash shuffle memory, have scalable fast storage, etc... which we do. Note this is even on GCP, where there isn't any "great" networking infra available.

Things that don't get accelerated include multi-column UDFs and some incompatible operations. These aren't physical/logical limitations, it's just where the software is right now: https://github.com/NVIDIA/spark-rapids/issues

Multi-column UDF support would likely require some compiler-esque work in Scala (which I happen to have experience in).

A few things I expect to be "very" good: joins, string aggregations (empirically), sorting (clustering). Operations which stress memory bandwidth will likely be "surprisingly" good (surprising to most people).

Otherwise, Nvidia has published a bunch of really-good-looking public data, along with some other public companies.

Outside of Spark, I think many people underestimate how "low-latency" GPUs can be. 100 microseconds and above is highly likely to be a good fit for GPU acceleration in general, though that could be as low as 10 microseconds (today).