|
|
|
|
|
by dragontamer
572 days ago
|
|
Maximum performance? Okay, you'll have to upgrade to ballot instructions or whatever and rearchitect your algorithms. (Or other wavefront / voting / etc. etc. new instructions that have been invented. Especially those 4x4 matrix multiplication AI instructions). But CUDA -> PTX intermediate code has allowed for significantly more flexibility. For crying out loud, the entire machine code (aka SASS) of NVidia GPUs has been cycled out at least 4 times in the past decade (128-bit bundles, changes to instruction formats, acquire/release semantics, etc etc) It's amazing what backwards compatibility NVidia has achieved in the past 15 years thanks to this architecture. SASS changes so dramatically from generation to generation but the PTX intermediate code has stayed highly competitive. |
|
Which is the same with PTX, right? If you didn't use the tensor core instructions or wavefront voting in the CUDA code, the PTX generated from it will not either, and NVIDIA will not magically add those capabilities in when compiling to SASS.
Maybe it remains competitive because the code is inherently parallel anyway, so it will naturally scale to fill the extra execution units of the GPU, which is where most of the improvement is generation to generation.
While AVX code can't automatically scale to use the AVX512 units.