Hacker News new | ask | show | jobs
by sjkelly 1531 days ago
This is one of the benefits of a capable JIT for numerical computing. Even including compile time, the overall execution time is way lower. Chris Elrod is an AVX2/512 whisperer.
1 comments

JIT-compiled code is way faster than linear algebra kernels which are completely vectorized by GCC et al? How does that work?
if you only vectorize the linear algebra, you leave performance on the table. Vectorizing fused operations reduces the number of memory passes. Also knowing the sizes (which are chosen at runtime) is necessary to make optimal decisions.
I was responding to a blanket statement alluding to AVX512 (which is rather a can of worms). I don't understand what this is on about, especially with reference to the article. Of course you account for matrix dimensions at run time, like to decide whether or not to do packing in GEMM, whether to call out to GEMM for matmul, etc.
SimpleChains does not call out to gemm. All layers are currently implemented as naive for loops. It uses LoopVectorization.jl to compile them, which does a good job leveraging AVX512 -- much better than llvm or gcc.

It doesn't pack, which is why column major A' * B is slow: https://juliasimd.github.io/LoopVectorization.jl/latest/exam... And also why SimpleChains wouldn't scale to large matrix multiplies. But, with 1 MiB L2 cache (e.g. Skylake-X), a 512x256 dense layer of Float32 is still small enough to not really need packing, so I haven't yet needed to implement it (but I will eventually, in a future version of a rewritten LoopVectorization that also actually adds dependency analysis). For an ML library, I'd implement packing via changing the data layout of the parameter vector to tile major, to just skip any runtime packing altogether (i.e., the data would be pre-packed). Only extremely large arrays benefit from a second packing level, so that I don't think it's worthwhile; smaller batch sizes would avoid the need.

The benchmark plots I linked above used dynamically sized arrays. LoopVectorization.jl performed better at small sizes than MKL, and much better than everything else. Compared to that application, LV can also specialize on compile time sizes, fuse the addition of the bias vector, and fuse the activation function.