Hacker News new | ask | show | jobs
by bcoates 973 days ago
"For example, what if the parallel sums are of different lengths? On GPUs, fast parallel reductions only work when inputs all have the same length. [...] Vexpr’s vectorizer groups the inputs by length and performs a reduced number of operations—one for each unique length."

I'm surprised this is necessary, I thought modern vectorization on both CPU and GPU handled heterogenous vectorization cases like this handily with conditional execution (on SMT GPUs) or mask registers (on SIMD CPUs)

1 comments

There may be tricks that I don't know about. One quick experimental answer I can give: if I change to looping over the sums and rerun Benchmark 3, my time in the aten::sum CUDA kernel increases from 0.779s (before) to 0.840ms (after). So CUDA doesn't seem to automagically handle this.

I will note that these grouped operations occasionally cause a net loss in performance compared to "naive" looping, since it involves calling PyTorch's "x.view(...)" which is usually ~instant but sometimes adds some extra CUDA operations on the backward pass. It always reduces the time spent in aten::add, but adds these extra ops. A really smart vectorizer would use heuristics to decide how/whether to group operations according to the target hardware; my current vectorizer just does the grouping every time.