Hacker News new | ask | show | jobs
by ffriend 3221 days ago
BLAS really shines when you do matrix multiplication, for element-wise operations the best you can do is to add numbers using SIMD instructions or put the load to GPU, and most numeric libraries already when possible. The benchmark about seems unrealistic, here are results from my newest MaBook Pro:

    In [2]: import numpy as np

    In [3]: X = np.ones(1000000000, dtype=np.int)

    In [4]: Y = np.ones(1000000000, dtype=np.int)

    In [5]: %time X = X + 2.0 * Y
    CPU times: user 10.4 s, sys: 27.1 s, total: 37.5 s
    Wall time: 46 s

    In [6]: %time X = X + 2 * Y
    CPU times: user 8.66 s, sys: 26 s, total: 34.7 s
    Wall time: 42.6 s

    In [7]: %time X += 2 * Y
    CPU times: user 8.58 s, sys: 23.2 s, total: 31.8 s
    Wall time: 37.7 s

    In [8]: %time np.add(X, Y, out=X); np.add(X, Y, out=X)
    CPU times: user 11.3 s, sys: 25.6 s, total: 36.9 s
    Wall time: 42.6 s
No surprise, Julia makes nearly the same result:

    julia> X = ones(Int, 1000000000);
    julia> Y = ones(Int, 1000000000); 

    julia> @btime X .= X .+ 2Y
      34.814 s (6 allocations: 7.45 GiB)

UPD. Just noticed 7.45Gib allocations. We can get rid of it as:

    julia> @btime X .= X .+ 2 .* Y
      20.464 s (4 allocations: 96 bytes
or:

    julia> @btime X .+= 2 .* Y
      20.098 s (4 allocations: 96 bytes)
2 comments

I could have not noticed use of swap in the previous test, so I repeated it on a Linux box and 1e8 numbers (instead of 1e9). Julia took 100.583ms while Python 207ms (probably due to double reading of the array). So I guess adding 1e9 numbers should take about 1 second on a modern desktop CPU.
I think the benchmark was probably done on a supercomputer. But that's really interesting how well Julia did. I did a basic logistic regression ML implementation in it years ago and I was impressed, but I stopped following its progress. Might have to keep it on my radar!