|
|
|
|
|
by bcoates
973 days ago
|
|
"For example, what if the parallel sums are of different lengths? On GPUs, fast parallel reductions only work when inputs all have the same length. [...] Vexpr’s vectorizer groups the inputs by length and performs a reduced number of operations—one for each unique length." I'm surprised this is necessary, I thought modern vectorization on both CPU and GPU handled heterogenous vectorization cases like this handily with conditional execution (on SMT GPUs) or mask registers (on SIMD CPUs) |
|
I will note that these grouped operations occasionally cause a net loss in performance compared to "naive" looping, since it involves calling PyTorch's "x.view(...)" which is usually ~instant but sometimes adds some extra CUDA operations on the backward pass. It always reduces the time spent in aten::add, but adds these extra ops. A really smart vectorizer would use heuristics to decide how/whether to group operations according to the target hardware; my current vectorizer just does the grouping every time.