|
|
|
|
|
by dist-epoch
573 days ago
|
|
> If you just wrote your SIMD in CUDA 15 years ago, NVidia compilers would have given you maximum performance across all NVidia GPUs That's not true. For maximum performance you need to tweak the code to a particular GPU model/architecture. Intel has SSE/AVX/AVX2/AVX512, but CUDA has like 10 iterations of this (increasing capabilities). Code written 15 years ago would not use modern capabilities, like more flexible memory access, atomics. |
|
But CUDA -> PTX intermediate code has allowed for significantly more flexibility. For crying out loud, the entire machine code (aka SASS) of NVidia GPUs has been cycled out at least 4 times in the past decade (128-bit bundles, changes to instruction formats, acquire/release semantics, etc etc)
It's amazing what backwards compatibility NVidia has achieved in the past 15 years thanks to this architecture. SASS changes so dramatically from generation to generation but the PTX intermediate code has stayed highly competitive.