|
|
|
|
|
by pixelpoet
973 days ago
|
|
I really hope those pow(x, 2) calls are getting turned into x * x, else it's a performance catastrophe / extreme beginner mistake even with vectorisation. Also, this kind of ultra wide buffering consumes a ton of memory bandwidth for each operation, instead of keeping a small portion in cache/registers. FLOPs are scaling sort of infinitely, whereas memory speed is flat, so this is increasingly a losing game; just because it's faster than glacial Python doesn't mean it's fast compared to a language which actually concerns itself with performance or a more cache aware approach. For an extreme example of how you can even sometimes beat ultra optimised GPU ML libraries in this way, check out https://github.com/NVlabs/tiny-cuda-nn |
|
> I studied the CUDA traces closely and found that vectorization does indeed reduce many aspects of the GPU workload, greatly reducing the number of operations and decreasing the total amount of time spent on the fundamental computations of the algorithm. However it also introduces overhead (mentioned above) by interspersing operations that permute and reorder the tensors, or splitting them into groups then concatenating results. Sometimes the reduced “fundamental” time outweighs the additional overhead, while other times the overhead outweighs the reduction in fundamental time.
Here are some examples not included in the blog post:
- Total time spent in aten::cdist kernel
- Total time spent in aten::mul kernel This nice little win applies to tons of other kernels, almost across the board. As you point out, CPU intuition suggests this should have been slower, so this was an interesting outcome.On the other hand, some specific increases occur:
- Total time spent in aten::cat kernel
So working in fewer, larger batches doesn't only enable outrunning the GPU. It decreases the total GPU workload... then adds some overhead. But some of this overhead could be removed with custom CUDA kernels, so I think this is an interesting direction even if you solve the CPU problem some other way.(The pow(x, 2) is only there in the toy code, not my actual kernel, so I didn't performance-tune it.)