| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tonetegeatinst 244 days ago
	Wonder if PTX programming for a GPU would accelerate this.

4 comments

ashtonsix 244 days ago

If the data is already in GPU memory, yes. Otherwise you'll be limited by the DRAM<->VRAM memory bottleneck.

When we consider that delta coding (and family), are typically applied as one step in a series of CPU-first transforms and benefit from L1-3 caching we find CPU throughput pulls far-ahead of GPU-based approaches for typical workloads.

This note holds for all GPU-based approaches, not just PTX.

link

_zoltan_ 244 days ago

what is a typical workload that you speak of, where CPUs are better?

We've been implementing GPU support in Presto/Velox for analytical workloads and I'm yet to see a use case where we wouldn't pull ahead.

The DRAM-VRAM memory bottleneck isn't really a bottleneck on GH/GB platforms (you can pull 400+GB/s across the C2C NVLink), and on NVL8 systems like the typical A100/H100 deployments out there, doing real workloads, where the data is coming over the network links, you're toast without using GPUDirect RDMA.

link

tmostak 244 days ago

Even without NVLink C2C, on a GPU with 16XPCIe 5.0 lanes to host, you have 128GB/sec in theory and 100+ GB/sec in practice bidirectional bandwidth (half that in each direction), so still come out ahead with pipelining.

Of course prefix sums are often used within a series of other operators, so if these are already computed on GPU, you come out further ahead still.

link

ashtonsix 244 days ago

Haha... GPUs are great. But do you mean to suggest we should swap a single ARM core for a top-line GPU with 10k+ cores and compare numbers on that basis? Surely not.

Let's consider this in terms of throughput-per-$ so we have a fungible measurement unit. I think we're all agreed that this problem's bottleneck is the host memory<->compute bus so the question is: for $1 which server architecture lets you pump more data from memory to a compute core?

It looks like you can get a H100 GPU with 16xPCIe 5.0 (128 GB/s theoretical, 100 GB/s realistic) for $1.99/hr from RunPod.

With an m8g.8xlarge instance (32 ARM CPU cores) you should get much-better RAM<->CPU throughput (175 GB/s realistic) for $1.44/hr from AWS.

link

_zoltan_ 244 days ago

GH200 is $1.5/hr at lambda and can do 450GB/s to the GPU. seems still cheaper?

link

ashtonsix 244 days ago

By typical I imagined adoption within commonly-deployed TSDBs like Prometheus, InfluxDB, etc.

GB/GH are actually ideal targets for my code: both architectures integrate Neoverse V2 cores, the same core I developed for. They are superchips with 144/72 CPU cores respectively.

The perf numbers I shared are for one core, so multiply the numbers I gave by 144/72 to get expected throughput on GB/GH. As you (apparently?) have access to this hardware I'd sincerely appreciate if you could benchmark my code there and share the results.

link

_zoltan_ 244 days ago

GB is CPU+2xGPU.

GH is readily available for anybody at 1.5 dollars per hour on lambda; GB is harder and we're just going to begin to experiment on it.

link

ashtonsix 244 days ago

Each Grace CPU has multiple cores: https://www.nvidia.com/en-gb/data-center/grace-cpu-superchip

This superchip (might be different to whichever you're referring to) has 2 CPUs (144 cores): https://developer.nvidia.com/blog/nvidia-grace-cpu-superchip...

link

bassp 244 days ago

Yes! There’s a canonical algorithm called the “Blelloch scan” for prefix sum (aka prefix scan, because you can generalize “sum” to “any binary associative function”) that’s very gpu friendly. I have… fond is the wrong word, but “strong” memories of implementing in a parallel programming class :)

Here’s a link to a pretty accessible writeup, if you’re curious about the details: https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-co...

link

ashtonsix 244 days ago

Mm, I used that exact writeup as a reference to implement this algorithm in WebGL 3 years ago: https://github.com/ashtonsix/webglc/blob/main/src/kernel/sca...

It even inspired the alternative "transpose" method I describe in the OP README.

link

almostgotcaught 244 days ago

Lol do you think "PTX programming" is some kind of trick path to perf? It's just inline asm. Sometimes it's necessary but most of the time "CUDA is all you need":

https://github.com/b0nes164/GPUPrefixSums

link

TinkersW 244 days ago

Your average none shared memory GPU communicates with the CPU over PCIe which is dogshit slow, like 100x slower than DRAM.

I can upload about an average of 3.7 MBs per millisecond to my GPU(PCIe gen 3, x8), but it can be spiky and sometimes take longer than you might expect.

By comparison a byte based AVX2 prefix scan can pretty much run at the speed of DRAM, so there is never any reason to transfer to the GPU.

link