|
|
|
|
|
by _zoltan_
241 days ago
|
|
what is a typical workload that you speak of, where CPUs are better? We've been implementing GPU support in Presto/Velox for analytical workloads and I'm yet to see a use case where we wouldn't pull ahead. The DRAM-VRAM memory bottleneck isn't really a bottleneck on GH/GB platforms (you can pull 400+GB/s across the C2C NVLink), and on NVL8 systems like the typical A100/H100 deployments out there, doing real workloads, where the data is coming over the network links, you're toast without using GPUDirect RDMA. |
|
Of course prefix sums are often used within a series of other operators, so if these are already computed on GPU, you come out further ahead still.