| HN Mirror

iirc that's only with some processors using AVX2/512 instructions, which are still faster than scalar (depending on what you're doing around the math).

Things being tricky with AVX and cache-weirdness doesn't change the fact that if you're not vectorizing your arithmetic you're losing performance.

There's also an argument to be made if you're writing performance critical code you shouldn't optimize it for mid/bottom end hardware, but not everyone agrees.