| Agree. Normal Python for loop apply to a Numpy array to do simple math is just pure nonsense. Just tested how would it be without compile nonsense. ``` a = np.random.random(int(1e6)) %%timtit np.average(a) %timeit np.average(a[::16]) ``` And my result is that no matter how uncontiguous in memory (here I take every 16 elements like what they did, and I tested for 2,4,8,16), we are doing less operations so it always end up faster. Contrastingly their SIMD compiled code is 10-20X slower in uncontiguous case. And for a larger array that is 16X of the contiguous one, but we only take 1/16 of its element, the result is like 10X slower as shown by the article. But I suspect that purely now you have a 16X larger array to load from memory, which itself is slow in nature. ``` b = np.random.random(int(16e6)) np.average(b[::16]) ``` Which conclude that people should use Numpy in the right way. It is really hard to beat pure numpy speed. |