|
BLAS really shines when you do matrix multiplication, for element-wise operations the best you can do is to add numbers using SIMD instructions or put the load to GPU, and most numeric libraries already when possible. The benchmark about seems unrealistic, here are results from my newest MaBook Pro: In [2]: import numpy as np
In [3]: X = np.ones(1000000000, dtype=np.int)
In [4]: Y = np.ones(1000000000, dtype=np.int)
In [5]: %time X = X + 2.0 * Y
CPU times: user 10.4 s, sys: 27.1 s, total: 37.5 s
Wall time: 46 s
In [6]: %time X = X + 2 * Y
CPU times: user 8.66 s, sys: 26 s, total: 34.7 s
Wall time: 42.6 s
In [7]: %time X += 2 * Y
CPU times: user 8.58 s, sys: 23.2 s, total: 31.8 s
Wall time: 37.7 s
In [8]: %time np.add(X, Y, out=X); np.add(X, Y, out=X)
CPU times: user 11.3 s, sys: 25.6 s, total: 36.9 s
Wall time: 42.6 s
No surprise, Julia makes nearly the same result: julia> X = ones(Int, 1000000000);
julia> Y = ones(Int, 1000000000);
julia> @btime X .= X .+ 2Y
34.814 s (6 allocations: 7.45 GiB)
UPD. Just noticed 7.45Gib allocations. We can get rid of it as: julia> @btime X .= X .+ 2 .* Y
20.464 s (4 allocations: 96 bytes
or: julia> @btime X .+= 2 .* Y
20.098 s (4 allocations: 96 bytes)
|