|
|
|
|
|
by gnufx
1531 days ago
|
|
I was responding to a blanket statement alluding to AVX512 (which is rather a can of worms). I don't understand what this is on about, especially with reference to the article. Of course you account for matrix dimensions at run time, like to decide whether or not to do packing in GEMM, whether to call out to GEMM for matmul, etc. |
|
It doesn't pack, which is why column major A' * B is slow: https://juliasimd.github.io/LoopVectorization.jl/latest/exam... And also why SimpleChains wouldn't scale to large matrix multiplies. But, with 1 MiB L2 cache (e.g. Skylake-X), a 512x256 dense layer of Float32 is still small enough to not really need packing, so I haven't yet needed to implement it (but I will eventually, in a future version of a rewritten LoopVectorization that also actually adds dependency analysis). For an ML library, I'd implement packing via changing the data layout of the parameter vector to tile major, to just skip any runtime packing altogether (i.e., the data would be pre-packed). Only extremely large arrays benefit from a second packing level, so that I don't think it's worthwhile; smaller batch sizes would avoid the need.
The benchmark plots I linked above used dynamically sized arrays. LoopVectorization.jl performed better at small sizes than MKL, and much better than everything else. Compared to that application, LV can also specialize on compile time sizes, fuse the addition of the bias vector, and fuse the activation function.