| HN Mirror

Somewhat, but I think people vastly over-estimate their ability.

A common example is if there's any accumulation/reduction, compilers will almost entirely fail to generate SIMD unless you use -funsafe-math-optimizations type flags, because of non-associativity of floating point. Sum of squares is the classic example (not saying that specific operation is used in NN).

Explicit vectorization (e.g., using intrinsics) is almost always a relatively simple way to get orders of magnitude speedup compared to auto-vectorization, because of the above. Also because data layouts usually need to change as well (AoS vs SoA, etc.), though NN people seem to write decent data layouts.

I don't have any experience with `#pragma omp` type approaches which may be a middle ground.