| HN Mirror

SimpleChains does not call out to gemm. All layers are currently implemented as naive for loops. It uses LoopVectorization.jl to compile them, which does a good job leveraging AVX512 -- much better than llvm or gcc.

It doesn't pack, which is why column major A' * B is slow: https://juliasimd.github.io/LoopVectorization.jl/latest/exam... And also why SimpleChains wouldn't scale to large matrix multiplies. But, with 1 MiB L2 cache (e.g. Skylake-X), a 512x256 dense layer of Float32 is still small enough to not really need packing, so I haven't yet needed to implement it (but I will eventually, in a future version of a rewritten LoopVectorization that also actually adds dependency analysis). For an ML library, I'd implement packing via changing the data layout of the parameter vector to tile major, to just skip any runtime packing altogether (i.e., the data would be pre-packed). Only extremely large arrays benefit from a second packing level, so that I don't think it's worthwhile; smaller batch sizes would avoid the need.

The benchmark plots I linked above used dynamically sized arrays. LoopVectorization.jl performed better at small sizes than MKL, and much better than everything else. Compared to that application, LV can also specialize on compile time sizes, fuse the addition of the bias vector, and fuse the activation function.