Hacker News new | ask | show | jobs
by bob1029 572 days ago
I see a lot of "just use the GPU" and you'd often be right.

SIMD on the CPU is most compelling to me due to the latency characteristics. You are nanoseconds away from the control flow. If the GPU needs some updated state regarding the outside world, it takes significantly longer to propagate this information.

For most use cases, the GPU will win the trade off. But, there is a reason you don't hear much about systems like order matching engines using them.

3 comments

Despite my 'Use a GPU' post below, you are absolutely correct.

Maximizing performance on a CPU today requires all the steps in the above article, and the article is actually very well written with regards to the 'mindset' needed to tackle a problem such as this.

It's a great article for people aiming to maximize the performance on Intel or AMD systems.

------

CPUs have the memory capacity advantage and will continue to hold said advantage for the foreseeable future (despite NVidias NVLink and other techs to try to bridge the gap).

And CPU code remains far easier than learning CUDA, despite how hard these AVX intrinsics are in comparison to CUDA.

> CPUs have the memory capacity advantage

perhaps also more precisely they also have quite an advantage on anything that needs and plays nicely with caches? when I sliced my problem to maximize cache usage, I saw pretty clear scalability with cores: L1/L2 cache bandwidth is ~30GB/s, so e.g. a 32 core system starts to compete with the big consumer GPUs.

Caches between CPU and GPU are extremely complex.

Not only because caching is complex in CPU land, but because GPU has a completely different set of caches (and registers) that depends entirely on architecture.

Case in point: GPUs often have access to 256 architectural registers aka 1024-bytes of register space. Now this depends on how much occupancy your GPU code is targeting (maybe 4-occupancy?), but there's a lot you can do with even 64-registers (aka: 4-occupancy and 256-bytes of register space), and is key to making those blazing fast FP16 matrix-multiplication kernels "for AI" everyone's so hot about right now.

For something like FP16 matrix multiply (a very cache-friendly problem), the entire SIMD (all 32-lanes) of the GPU symmetric multiprocessor work together on the problem. So we're talking about an effective 32kB of __register space__ (let alone cache or other memory in the hierarchy).

Even before FP16 matrix-multiply instructions, this absurd register space advantage is why GPUs were the king of matrix multiplication.

--------

GPUs are worse at larger cache sizes, say 1MB or 2MB. At 1MB+, a modern CPU core's L2 cache can hold all of that (either Intel P-core, AMD Zen5, or even Intel E-core can hold many MB in L2).

GPUs have a secret though: a crossbar at the __shared__ memory level. Rearranging memory and data across your lanes can be done through this crossbar (including many-to-one reductions for atomics, or one-to-many broadcasts in just a single clock tick). So your GPU-lanes have incredible communication available to them (and this crossbar is the key for modern ballot / voting based horizontal compute of modern GPU styles). This is only ~64kB of space and shared between all GPU-lanes but with 1024-lanes supporting communication its an important element of GPU memory.

CPU L3 cache is very nice from a latency perspective, but bandwidth wise L3 cache is on the order of GPU's GDDR6x or HBM. 500GB/s to 2000GB/s, depending on the technology.

Finally, CPU DDR5 RAM may be the slowest in this discussion, but its the biggest. 2TB+ Xeon Servers aren't even that expensive and can be assumed for any serious tech firm these days (ie: all-RAM Databases and whatnot).

---------

So at different sizes and different use-scenarios, GPUs and CPUs will trade places. I'd expect CPUs to win most red-and-black tree races, but GPUs will win matrix multiplication. Both take advantage of "cache" but in very different ways.

You would be surprised. The GPU often loses even for small neural nets given the large latency. Anything that needs high throughput or is sized like an HPC problem should use a GPU, but a lot of code benefits from SIMD on small problems.
If you run many small tasks on the GPU, you can increase throughput by overlapping transfers and computation. There may also be other ways to batch problems together, but that depends on the algorithms.

The one truly unfixable issue is round-trip latency.

> The GPU often loses even for small neural nets given the large latency

Apple's neural engine shows that you can live in between those two worlds.

As you said, the trouble is the latency, the programming model is still great.

Do Apple's chips (M1 etc) change this at all, since they share memory with the GPU?
Apple chips share the same physical memory between the GPU and the CPU. Still, they don't have USM/UVM (Unified Shared Memory/Unified Virtual Memory), that is, the GPU and the CPU can't access the same data concurrently and easily. Programs must map/unmap pages to control which device accesses it, and that's a very expensive operation.
They don't need to be unmapped just for the other one to use it. source: I wrote GPU drivers for over 10 years.
Not much. Synchronisation of tasks is still a big overhead.

If you've got tens to hundreds of microseconds worth of workload, sure, get the GPU to do it.

But bear in mind 1000 clocks at 4GHz is 250ns, there's still a sizeable region where tight CPU/GPU integration isn't tight enough.

I think an argument could be made depending on the real world timings. How much closer in time is the Apple GPU vs one on a PCIe bus?