NumPy has a whole dispatch mechanism to deal with the tradeoffs. The main problem is code bloat: how many microarchitectures are you going to support with dispatch at runtime?
Numpy is interesting in that regard since its dispatch mechanism adds up to a lot of overhead. There are a lot of problems where a naive list comprehension is faster, even when SIMD could be used to great effect.