Agree. Normal Python for loop apply to a Numpy array to do simple math is just pure nonsense.
Just tested how would it be without compile nonsense.
```
a = np.random.random(int(1e6))
%%timtit
np.average(a)
%timeit
np.average(a[::16])
```
And my result is that no matter how uncontiguous in memory (here I take every 16 elements like what they did, and I tested for 2,4,8,16), we are doing less operations so it always end up faster. Contrastingly their SIMD compiled code is 10-20X slower in uncontiguous case.
And for a larger array that is 16X of the contiguous one, but we only take 1/16 of its element, the result is like 10X slower as shown by the article. But I suspect that purely now you have a 16X larger array to load from memory, which itself is slow in nature.
```
b = np.random.random(int(16e6))
np.average(b[::16])
```
Which conclude that people should use Numpy in the right way. It is really hard to beat pure numpy speed.
But that's precisely what makes this a good exercise, you can see how far you are able to close the gap between the naive looping implementation and the optimized array implementation.
Just tested how would it be without compile nonsense.
```
a = np.random.random(int(1e6))
%%timtit
np.average(a)
%timeit
np.average(a[::16])
```
And my result is that no matter how uncontiguous in memory (here I take every 16 elements like what they did, and I tested for 2,4,8,16), we are doing less operations so it always end up faster. Contrastingly their SIMD compiled code is 10-20X slower in uncontiguous case.
And for a larger array that is 16X of the contiguous one, but we only take 1/16 of its element, the result is like 10X slower as shown by the article. But I suspect that purely now you have a 16X larger array to load from memory, which itself is slow in nature.
```
b = np.random.random(int(16e6))
np.average(b[::16])
```
Which conclude that people should use Numpy in the right way. It is really hard to beat pure numpy speed.