Hacker News new | ask | show | jobs
by dist-epoch 573 days ago
> If you just wrote your SIMD in CUDA 15 years ago, NVidia compilers would have given you maximum performance across all NVidia GPUs

That's not true. For maximum performance you need to tweak the code to a particular GPU model/architecture.

Intel has SSE/AVX/AVX2/AVX512, but CUDA has like 10 iterations of this (increasing capabilities). Code written 15 years ago would not use modern capabilities, like more flexible memory access, atomics.

1 comments

Maximum performance? Okay, you'll have to upgrade to ballot instructions or whatever and rearchitect your algorithms. (Or other wavefront / voting / etc. etc. new instructions that have been invented. Especially those 4x4 matrix multiplication AI instructions).

But CUDA -> PTX intermediate code has allowed for significantly more flexibility. For crying out loud, the entire machine code (aka SASS) of NVidia GPUs has been cycled out at least 4 times in the past decade (128-bit bundles, changes to instruction formats, acquire/release semantics, etc etc)

It's amazing what backwards compatibility NVidia has achieved in the past 15 years thanks to this architecture. SASS changes so dramatically from generation to generation but the PTX intermediate code has stayed highly competitive.

Intel code from 15 years ago also runs today. But it will not use AVX512.

Which is the same with PTX, right? If you didn't use the tensor core instructions or wavefront voting in the CUDA code, the PTX generated from it will not either, and NVIDIA will not magically add those capabilities in when compiling to SASS.

Maybe it remains competitive because the code is inherently parallel anyway, so it will naturally scale to fill the extra execution units of the GPU, which is where most of the improvement is generation to generation.

While AVX code can't automatically scale to use the AVX512 units.

It's not the same. AVX2 instructions haven't changed and never will change.

In contrast, NVidia can go from 64-bit instruction bundles to 128-bit machine code (96-bit instruction + 32-bit control information) between Pascal (aka PTX Compute Capacity 5) and Voltage (aka PTX Compute Capacity 7) and all the old PTX code just autocompiles to the new assembly instruction format and takes advantage of all the new memory barriers added in Volta.

Having a PTX translation later is a MAJOR advantage for the NVidia workflow.

There is still a lot of similarity between CPU and GPU programming - between AVX and PTX. Different generations of CPU cores handle the same AVX2 instructions differently. The microcode changes and the schedulers change, but the process is transparent for the user, similar to PTX.
I imagine there is and order of magnitude of difference between how much you can translate in software, with large memory and significant time budget to work with, compared to microcode.
Most CPU instructions are 1-to-1 with their microcode. I dare say that microcode is nearly irrelevant, any high-performance instruction (ex: multiply, add, XOR, etc. etc.) is but a single instruction anyway.

Load/Store are memory dependent in all architectures. So that's just a different story as CPUs and GPUs have completely different ideas of how caches should work. (CPUs aim for latency, GPUs for bandwidth + incredibly large register spaces with substantial hiding of latency thanks to large occupancies).

-------------

That being said: reorder buffers on CPUs are well over 400-instructions these days, with super-large cores (like Apple's M4) is apparently on the order of 600 to 800 instructions.

Reorder buffers are _NOT_ translation. They're Tomasulo's algorithm (https://en.wikipedia.org/wiki/Tomasulo%27s_algorithm). If you want to know how CPUs do out-of-order, study that.

I'd say CPUs have small register spaces (16 architectural registers, maybe 32), but large register files of maybe 300 or 400+. Tomasulo's algorithm is used to out-of-order access registers.

You should think of instructions like "mov rax, [memory]" as closer to "rax = malloc(register); delayed-load(rax, memory); Out-of-order execute all instructions that don't use RAX ahead of us in instruction stream".

Tomasulo's algorithm means using ~300-register file to _pretend_ to be just 16 architectural registers. The 300 registers keeps the data out-of-order and allows you to execute. Registers in modern CPUs are closer to unique_ptr<int> in C++, assigning them frees (aka: reorder buffer) and also mallocs a new register off the register-file.

I hope people aren't writing directly to AVX2. When using a wrapper such as Highway, you get exactly this kind of update after a recompile, or even just running your code on a CPU that supports newer instructions.

The cost is that the binary carries around both AVX2 and AVX-512 codepaths, but that is not an issue IMO.

Many use cases for SIMD aren't trivially expressible through wrappers and abstractions. It is sometimes cleaner, easier, and produces more optimized codegen to write the intrinsics directly. It isn't ideal but it often produces the best result for the effort involved.

An issue with the abstractions that does not go away is that the optimal code architecture -- well above the level of the SIMD wrappers -- is dependent on the capabilities of the silicon. The wrappers can't solve for that. And if you optimize the code architecture for the silicon architecture, it quickly approximates writing architecture-specific intrinsics with an additional layer of indirection, which significantly reduces any notional benefit from the abstractions.

The wrappers can't abstract enough, and higher level abstractions (written with architecture aware intrinsics) are often too use case specific to reuse widely.

Wrappers can be zero-overhead, so any claim of better codegen vs the underlying intrinsics sounds dubious. "best result for the [higher] effort involved" also contradicts my experience, so I ask for evidence.

One counterexample: our portable vqsort [1] outperforms AVX-512-specific intrinsics [2].

I agree that high-level design may differ. You seem aware that Highway, and probably also other wrappers, supports specializing code for some target(s), but possibly misunderstand how, given the "additional layer of indirection" claim. Wrappers give you a portable baseline, and remove some of the potholes and ugly syntax, but boil down to inlined wrapper functions.

If you want to specialize, that is supported. And what is the downside? Even if you say the benefit of a wrapper is reduced vs manually written intrinsics (and reinventing all the workarounds for their missing instructions), do you not agree that the benefit is still nonzero?

[1]: https://github.com/google/highway/tree/master/hwy/contrib/so... [2]: https://github.com/Voultapher/sort-research-rs/blob/38f37eef...

Most video encoders and decoders consist of kernels with hand written SIMD instructions/intrinsics.
Agreed. FWIW we demonstrated with JPEG XL (image codec, though also with animation 'video' support) that it is possible to write such kernels using the portable Highway intrinsics.
I would wager that most real world SIMD use is with direct intrinsics.
> I hope people aren't writing directly to AVX2.

Did you not read the article? It's using AVX intrinsics and NEON intrinsics.

I did, and I truly do not understand why some people do this. As shown in the reddit comments on this article [1], the initial intrinsics version was quite suboptimal and clearly worse than portable code [2].

When not busy unnecessarily rewriting everything for each ISA, it is easier to see and have time for vital optimizations such as unrolling :)

[1]: https://www.reddit.com/r/cpp/comments/1gzob1g/understanding_... [2]: https://github.com/google/highway/blob/master/hwy/contrib/do...

This is not really fair or true. Nvidia changes the meaning of PTX when they want to. For example, warp thread divergence is something they implemented in an architecture revision, technically breaking existing code. With SM90 (Hopper) they have even started including unstable features in PTX that they reduce promises for even further. And of course everyone who cares about performance is rewriting their kernels (or using someone else's rewritten kernels) for each new architecture. I honestly do not think it is fair to compare this to the CPU landscape, which has much stronger backwards compatibility guarantees.