|
|
|
|
|
by ashtonsix
242 days ago
|
|
If the data is already in GPU memory, yes. Otherwise you'll be limited by the DRAM<->VRAM memory bottleneck. When we consider that delta coding (and family), are typically applied as one step in a series of CPU-first transforms and benefit from L1-3 caching we find CPU throughput pulls far-ahead of GPU-based approaches for typical workloads. This note holds for all GPU-based approaches, not just PTX. |
|
We've been implementing GPU support in Presto/Velox for analytical workloads and I'm yet to see a use case where we wouldn't pull ahead.
The DRAM-VRAM memory bottleneck isn't really a bottleneck on GH/GB platforms (you can pull 400+GB/s across the C2C NVLink), and on NVL8 systems like the typical A100/H100 deployments out there, doing real workloads, where the data is coming over the network links, you're toast without using GPUDirect RDMA.