It has gotten very tricky on x86 since using the really wide SIMD instructions can more than halve your clock speed for ALL instructions running on that particular core.
iirc that's only with some processors using AVX2/512 instructions, which are still faster than scalar (depending on what you're doing around the math).
Things being tricky with AVX and cache-weirdness doesn't change the fact that if you're not vectorizing your arithmetic you're losing performance.
There's also an argument to be made if you're writing performance critical code you shouldn't optimize it for mid/bottom end hardware, but not everyone agrees.
Things being tricky with AVX and cache-weirdness doesn't change the fact that if you're not vectorizing your arithmetic you're losing performance.
There's also an argument to be made if you're writing performance critical code you shouldn't optimize it for mid/bottom end hardware, but not everyone agrees.