| HN Mirror

There may be tricks that I don't know about. One quick experimental answer I can give: if I change to looping over the sums and rerun Benchmark 3, my time in the aten::sum CUDA kernel increases from 0.779s (before) to 0.840ms (after). So CUDA doesn't seem to automagically handle this.

I will note that these grouped operations occasionally cause a net loss in performance compared to "naive" looping, since it involves calling PyTorch's "x.view(...)" which is usually ~instant but sometimes adds some extra CUDA operations on the backward pass. It always reduces the time spent in aten::add, but adds these extra ops. A really smart vectorizer would use heuristics to decide how/whether to group operations according to the target hardware; my current vectorizer just does the grouping every time.