Hacker News new | ask | show | jobs
by quotemstr 554 days ago
Not everything in every program is performance critical. A pattern I've noticed repeatedly among CUDAheads is the idea that "every cycle matters" and therefore we should uglify and optimize even cold parts of our CUDA programs. That's as much BS on GPU as it is on CPU. In CPU land, we moved past this sophomoric attitude decades ago. The GPU world might catch up one day.

Are you planning on putting fopen() in an inner loop or something? LOL

4 comments

The whole reason CUDA/GPUs are fast is that they explicitly don’t match the architecture of CPUs. The truly sophomoric attitude is that all compute devices should work like CPUs. The point of CUDA/GPUs is to provide a different set of abstractions than CPUs that enable much higher performance for certain problems. Forcing your GPU to execute CPU-like code is a bad abstraction.

Your comment about putting fopen in an inner loop really betrays that. Every thread in your GPU kernel is going to have to wait for your libc call. You’re really confused if you’re talking about hot loops in a GPU kernel.

> A pattern I've noticed repeatedly among CUDAheads is the idea that "every cycle matters" and therefore we should uglify and optimize even cold parts of our CUDA programs.

You're talking to the wrong people; this is definitely not true in general.

genuinely asking: where else should ML engineers focus their time, if not on looking at datapath bottlenecks in either kernel execution or the networking stack?
The point is that you should focus on the bottlenecks, not on making every random piece of code "as fast as possible". And that sometimes other things (maintainability, comprehensibility, debuggability) are more important than maximum possible performance, even on the GPU.
That's fair, but I didn't understand OP to be claiming above that "cudaheads" aren't looking at their performance bottlenecks before driving work, just that they're looking at the problem incorrectly (and eg: maybe should prioritize redesigns over squeezing perf out of flawed approaches.)
> A pattern I've noticed repeatedly among CUDAheads is the idea that "every cycle matters" and therefore we should uglify and optimize even cold parts of our CUDA programs

I don't know what a "cudahead" is but if you're gonna build up a strawman just to chop it down have at it. Doesn't change anything about my point - these aren't syscalls because there's no sys. I mean the dev here literally spells it out correctly so I don't understand why there's any debate.