| HN Mirror

I wondered about this same thing. Your logic about cache/registers is certainly true on CPUs, but what about GPUs? Hence this blurb:

> I studied the CUDA traces closely and found that vectorization does indeed reduce many aspects of the GPU workload, greatly reducing the number of operations and decreasing the total amount of time spent on the fundamental computations of the algorithm. However it also introduces overhead (mentioned above) by interspersing operations that permute and reorder the tensors, or splitting them into groups then concatenating results. Sometimes the reduced “fundamental” time outweighs the additional overhead, while other times the overhead outweighs the reduction in fundamental time.

Here are some examples not included in the blog post:

- Total time spent in aten::cdist kernel

  - Baseline: 2.834s (4900 calls)
  - Vectorized: 2.686s (500 calls)

- Total time spent in aten::mul kernel

  - Baseline: 5.745s (80700 calls)
  - Vectorized: 5.555s (8100 calls)

This nice little win applies to tons of other kernels, almost across the board. As you point out, CPU intuition suggests this should have been slower, so this was an interesting outcome.

On the other hand, some specific increases occur:

- Total time spent in aten::cat kernel

  - Baseline: 0.680s
  - Vectorized: 1.849s

So working in fewer, larger batches doesn't only enable outrunning the GPU. It decreases the total GPU workload... then adds some overhead. But some of this overhead could be removed with custom CUDA kernels, so I think this is an interesting direction even if you solve the CPU problem some other way.

(The pow(x, 2) is only there in the toy code, not my actual kernel, so I didn't performance-tune it.)