Hacker News new | ask | show | jobs
by eigenspace 1831 days ago
The julia implementation was also suboptimal to be fair. I sped it up considerably here:

https://discourse.julialang.org/t/i-just-decided-to-migrate-...

and then even further here:

https://discourse.julialang.org/t/i-just-decided-to-migrate-...

I'd actually be very interested if there's anything other than handwritten assembly that's faster than the second one I posted.

______________

Regarding your more readable version of the Numpy syntax though, I think you have to admit it's still not as readable as

    for i in 1:N
        A[i,j] = exp((100+im)*im*sqrt(a[i]^2 + a[j]^2))
    end
right?

I'll also note that the author of the post didn't come from the Julia community. He was a heavy user of Python and Fortran, calling Fortran from Python to speed up performance hotspots. He wrote this blogpost as his first introduction to Julia.

So I think if it does cast numpy in a poor light, it was not done so intentionally to make it look bad.

3 comments

Regarding your performance work, you've nerd sniped me into looking for analytical tricks to speed it up ;) We'll see...

Regarding the readability, I suppose it's a matter of taste at this point, but if I swapped `import numpy as np` for `from numpy import newaxis, exp, sqrt`, then IMO, the numpy is more readable:

    A = a[newaxis]
    M = exp(1j * k * sqrt(A**2 + A.T**2)) 
But then my tastes also run towards preferring the handwritten definitions to be in terms of vectors and transposes instead of elements and indices, at least until you get to many indexed tensors.

Side note: any idea why the python and fortran are exp(i k... while the julia is exp((100 +i) i...)? Is it something I overlooked?

> Regarding your performance work, you've nerd sniped me into looking for analytical tricks to speed it up ;) We'll see...

Looking forward to it! These microbenchmarks are very fun to explore.

> Side note: any idea why the python and fortran are exp(i k... while the julia is exp((100 +i) i...)? Is it something I overlooked?

Oh, that is a partially applied edit I guess. When the author posted on the julia forum, people pointed out that since the exponent was pure imaginary, it could be speeded up even more with cis(...) instead of exp(im * ...) but then the author claimed that exp((100 + im) * ...) was more representative of his actual workflow and I guess changed the julia version in his blogpost but not the Python or Fortran versions.

I got bored with trying to find an analytical boost, but I benchmarked a couple IMO super readable python versions (basically what's in my original comment after making the (100+i)i change):

https://colab.research.google.com/drive/1ABrZJlm8pwB6_Sd6ayO...

On my macbook, using XLA's jit in python gave about a 12-15x speedup on CPU over OP's solution, which was pretty cool, but I'm too lazy to figure out how to install and benchmark Julia on my machine. Applying a 12-15x speedup would at least beat the Julia MT solution in OP, and you've got to admit `exp(CONST * sqrt(A**2 + A.T**2))` is a pretty clean way to do it.

Then I ran on whatever GPU colab decided to give me (a P100), and for just adding a decorator, it's a 1000x-1900x speedup (better as n goes up). Hence my current honeymoon period with jax. I love the speed vs readability tradeoff.

> I'm too lazy to figure out how to install and benchmark Julia on my machine.

Assuming you're eager to try once you've found out:

You should be able to simply download and unpack a binary from: https://julialang.org/downloads/ I'd strongly recommend going with the current stable release (1.6.1).

To start the Julia REPL:

  bin/julia

To install packages:

  using Pkg
  Pkg.add("BenchmarkTools")
  Pkg.add("LoopVectorization")
then you can copy/paste code from eigenspaces link to discourse to run Julia benchmarks. E.g., you can copy/paste from https://discourse.julialang.org/t/i-just-decided-to-migrate-... if you define

  using LoopVectorization
  const var"@tvectorize" = var"@tturbo"
(because `@tvectorize` has since been renamed on the latest releases)
Also, note that Julia starts with only a single thread by default. You'd either need to add `-t4` when starting julia, e.g. `julia -t4`, or set the environmental variable `JULIA_NUM_THREADS=4` for 4 threads, for example.
Looks good, but I actually love that Julia gets at least the same performance right out of the box, with full interoperability and the great multiple dispatch logic.

Would be interesting to run your optimized code against the Julia optimized code that was linked to in this thread. Or run a Julia GPU benchmark as well.

As for loops versus vectorization - I personally am used to vectorized code as it has always been a requirement. Sometimes it makes sense to use (e.g. Matrix Algebra stuff).

On the other hand, I have also been in many situations where I could not vectorize my code. More code than I'm happy to admit includes list or dict comprehensions or even loops. Sure, not all performance critical. Nevertheless, the prospect of performance no matter what is pretty exciting.

Finally, threading and multiprocessing in python is a headache compared to Julia imo. Sometimes I just miss the good ol' "parfor" from Matlab.

All in all, Julia is a really appealing value proposition to people doing numeric computing.

> the great multiple dispatch logic

Curiously, as much as I love the Julia language, I cannot stand multiple dispatch. I find it extremely confusing, ugly, and downright disturbing. Why would you ever want two different functions with the same name? And if this is good, why stop here? Why not two different variables with the same name but different types? Depending on the context where they are used, the language could pick one or the other.

But this is what most languages do, except they have _singular_ dispatch. Every OOP language has different behaviour for different input types, except it only works for the _first_ input argument. Julia simply extends this to every argument.

How would you feel about having to write `sqrt_int32(x)`, `sqrt_int64(x)`, `sqrt_float64(x)`, `sqrt_rational_complex_float16(x)`, etc, etc, etc, etc, instead of just `sqrt`, and let dispatch or overloading take care of the different implementations?

You can print out the assembly it generated with code_native [0] and check if it is seems there are superfluous instructions.

To really tune things you'd have to check the instruction timings among other things[1] for your processor and to be fair apply the equivalent setting for your compiler back end as well if it supports tuning for your processor and anything else applicable.

While it is likely there is a possibility for improvement, like most optimization work, it is usually for diminishing returns for your time and may be small in nature, but sometimes compilers, even those like llvm, do something strange, so you never know for sure unless you check.

[0] https://docs.julialang.org/en/v1/stdlib/InteractiveUtils/#In...

[1] https://www.agner.org/optimize/

I only glanced at what you did but if this operating matrices or vectors, there are often crazy optimizations that can be made for CPU cache