Hacker News new | ask | show | jobs
by ianhorn 1830 days ago
This an unfair example of trying to make the numpy readable:

    n = len(a)
    M = np.exp(1j*k*(np.tile(a,(n,1))**2 + np.tile(a.reshape(n,1),(1,n))**2))
This has a common issue in functional programming and there's an easy trick to fix it. Change the big one liner by naming things:

    n = len(a)
    A = np.tile(a,(n,1))
    A_T = np.tile(a.reshape(n,1),(1,n))
    M = np.exp(1j*k*(A**2 + A_T**2))
Much easier to read now, and it even exposes an error (the sqrt is missing!). Before we fix that, let's make the obvious simplification of using the .T for transpose:

    n = len(a)
    A = np.tile(a,(n,1))
    M = np.exp(1j*k*(A**2 + A.T**2))
Then let's use numpy's broadcasting to be more idiomatic:

    A = a[:, np.newaxis]
    M = np.exp(1j*k*(A**2 + A.T**2))
Now let's fix the bug:

    A = a[:, np.newaxis]
    M = np.exp(1j*k*np.sqrt(A**2 + A.T**2))
or even

    A = a[:, np.newaxis]
    root_sum_squares = np.sqrt(A**2 + A.T**2)
    M = np.exp(1j*k*root_sum_squares)
I think the last two of these are fairer comparisons.

--------

And to pitch the hype train for jax: if you want to fuse the loops and/or get gpu/tpu for free, just swap np for jnp:

    A = a[:, jnp.newaxis]
    M = jnp.exp(1j*k*jnp.sqrt(A**2 + A.T**2))
7 comments

The julia implementation was also suboptimal to be fair. I sped it up considerably here:

https://discourse.julialang.org/t/i-just-decided-to-migrate-...

and then even further here:

https://discourse.julialang.org/t/i-just-decided-to-migrate-...

I'd actually be very interested if there's anything other than handwritten assembly that's faster than the second one I posted.

______________

Regarding your more readable version of the Numpy syntax though, I think you have to admit it's still not as readable as

    for i in 1:N
        A[i,j] = exp((100+im)*im*sqrt(a[i]^2 + a[j]^2))
    end
right?

I'll also note that the author of the post didn't come from the Julia community. He was a heavy user of Python and Fortran, calling Fortran from Python to speed up performance hotspots. He wrote this blogpost as his first introduction to Julia.

So I think if it does cast numpy in a poor light, it was not done so intentionally to make it look bad.

Regarding your performance work, you've nerd sniped me into looking for analytical tricks to speed it up ;) We'll see...

Regarding the readability, I suppose it's a matter of taste at this point, but if I swapped `import numpy as np` for `from numpy import newaxis, exp, sqrt`, then IMO, the numpy is more readable:

    A = a[newaxis]
    M = exp(1j * k * sqrt(A**2 + A.T**2)) 
But then my tastes also run towards preferring the handwritten definitions to be in terms of vectors and transposes instead of elements and indices, at least until you get to many indexed tensors.

Side note: any idea why the python and fortran are exp(i k... while the julia is exp((100 +i) i...)? Is it something I overlooked?

> Regarding your performance work, you've nerd sniped me into looking for analytical tricks to speed it up ;) We'll see...

Looking forward to it! These microbenchmarks are very fun to explore.

> Side note: any idea why the python and fortran are exp(i k... while the julia is exp((100 +i) i...)? Is it something I overlooked?

Oh, that is a partially applied edit I guess. When the author posted on the julia forum, people pointed out that since the exponent was pure imaginary, it could be speeded up even more with cis(...) instead of exp(im * ...) but then the author claimed that exp((100 + im) * ...) was more representative of his actual workflow and I guess changed the julia version in his blogpost but not the Python or Fortran versions.

I got bored with trying to find an analytical boost, but I benchmarked a couple IMO super readable python versions (basically what's in my original comment after making the (100+i)i change):

https://colab.research.google.com/drive/1ABrZJlm8pwB6_Sd6ayO...

On my macbook, using XLA's jit in python gave about a 12-15x speedup on CPU over OP's solution, which was pretty cool, but I'm too lazy to figure out how to install and benchmark Julia on my machine. Applying a 12-15x speedup would at least beat the Julia MT solution in OP, and you've got to admit `exp(CONST * sqrt(A**2 + A.T**2))` is a pretty clean way to do it.

Then I ran on whatever GPU colab decided to give me (a P100), and for just adding a decorator, it's a 1000x-1900x speedup (better as n goes up). Hence my current honeymoon period with jax. I love the speed vs readability tradeoff.

> I'm too lazy to figure out how to install and benchmark Julia on my machine.

Assuming you're eager to try once you've found out:

You should be able to simply download and unpack a binary from: https://julialang.org/downloads/ I'd strongly recommend going with the current stable release (1.6.1).

To start the Julia REPL:

  bin/julia

To install packages:

  using Pkg
  Pkg.add("BenchmarkTools")
  Pkg.add("LoopVectorization")
then you can copy/paste code from eigenspaces link to discourse to run Julia benchmarks. E.g., you can copy/paste from https://discourse.julialang.org/t/i-just-decided-to-migrate-... if you define

  using LoopVectorization
  const var"@tvectorize" = var"@tturbo"
(because `@tvectorize` has since been renamed on the latest releases)
Also, note that Julia starts with only a single thread by default. You'd either need to add `-t4` when starting julia, e.g. `julia -t4`, or set the environmental variable `JULIA_NUM_THREADS=4` for 4 threads, for example.
Looks good, but I actually love that Julia gets at least the same performance right out of the box, with full interoperability and the great multiple dispatch logic.

Would be interesting to run your optimized code against the Julia optimized code that was linked to in this thread. Or run a Julia GPU benchmark as well.

As for loops versus vectorization - I personally am used to vectorized code as it has always been a requirement. Sometimes it makes sense to use (e.g. Matrix Algebra stuff).

On the other hand, I have also been in many situations where I could not vectorize my code. More code than I'm happy to admit includes list or dict comprehensions or even loops. Sure, not all performance critical. Nevertheless, the prospect of performance no matter what is pretty exciting.

Finally, threading and multiprocessing in python is a headache compared to Julia imo. Sometimes I just miss the good ol' "parfor" from Matlab.

All in all, Julia is a really appealing value proposition to people doing numeric computing.

> the great multiple dispatch logic

Curiously, as much as I love the Julia language, I cannot stand multiple dispatch. I find it extremely confusing, ugly, and downright disturbing. Why would you ever want two different functions with the same name? And if this is good, why stop here? Why not two different variables with the same name but different types? Depending on the context where they are used, the language could pick one or the other.

You can print out the assembly it generated with code_native [0] and check if it is seems there are superfluous instructions.

To really tune things you'd have to check the instruction timings among other things[1] for your processor and to be fair apply the equivalent setting for your compiler back end as well if it supports tuning for your processor and anything else applicable.

While it is likely there is a possibility for improvement, like most optimization work, it is usually for diminishing returns for your time and may be small in nature, but sometimes compilers, even those like llvm, do something strange, so you never know for sure unless you check.

[0] https://docs.julialang.org/en/v1/stdlib/InteractiveUtils/#In...

[1] https://www.agner.org/optimize/

I only glanced at what you did but if this operating matrices or vectors, there are often crazy optimizations that can be made for CPU cache
>if you want to fuse the loops and get gpu/tpu for free

the work being done at JAX is great and Julia shares many of the goals (Julia would even be the pioneer in some cases). The best part is your library doesn't even need numpy/JAX as dependency. Yet your function that works on AbstractArray will work on arrays that live on GPU/TPU, for free. thanks to multi-dispatch, you don't need to write a ton of boiler plate code for interface and inherent some classes from JAX/numpy.

It seems to break down a bunch in real use cases though. For example, we found a differential equation benchmarks the Jax devs were optimizing, where it was still 6x slower than Julia (https://gist.github.com/ChrisRackauckas/62a063f23cccf3a55a4a...). Then the gradient calculations don't mix well with GPUs, at least for ODEs (https://github.com/google/jax/issues/5006). That lack of composability compared to Julia is the main issue that then creeps up in the bigger problems.
For fun, I ported your 3rd solution to J

  f =: 3 : 0
  'k a' =: y
  n =: #a
  A =: (n, n) $ a
  ^ 0j1 * k * 0.5 ^~ (*: A) + (*: |: A)
  )

I'm curious how the speed compares to Numpy, but I don't have a python environment installed. It ran in 6.6 seconds on my computer for n=10,000, but it used a ridiculous 10 gigabytes of memory. When you consider a 10,000x10,000 matrix has 100M elements, that you need two of them since you add two together and that a short has 4 bytes, you technically should only need 800 MB for the whole thing.
Hi, I'm the post author. As others have pointed out, I'm not a Julia evangelist of any sort, I am just starting to use Julia and I'm very enthusiastic so far.

As for more readable Python alternatives, indeed, the ‘np.tile’ isn’t strictly required, as Python will add a (1,n) matrix and a (n,1) matrix into an (n,n) matrix by default (I would be happier with an error message here, though).

There are many alternative ways to take advantage of this fact, for example what you have suggested

  A = a[:, np.newaxis]
  M = np.exp(1j*k*np.sqrt(A**2 + A.T**2))
Note that A is a (n,1) matrix here, which is not quite obvious at first sight. Equivalently, my new preferred alternative is resorting to two reshapes

  n = len(a)
  M = np.exp(1j*k*sqrt(a.reshape(n,1)**2 + a.reshape(1,n)**2)
which is more explicit about the shapes of the arrays involved. I have updated the post to better reflect this discussion.

In some more complex manual-vectorization cases I have encountered, the np.tile cannot be dropped, and the code looks a lot like my original posting.

Being able to resort to loops, and not even having to think about this manual-vectorization issues is a big plus, if you need to do this very often in your code.

Regards

Hi! Thanks for engaging with the comments.

I believe you've misunderstood the main point of my comment. I don't have too strong opinions about reshape vs newaxis.

The real point was about excessive inlining, which your update does not to fix. A fairer comparison regarding readability would be pulling out the matrix version of a into a named variable:

  n = len(a)
  A = a.reshape(n,1)
  M = exp(1j * k * sqrt(A**2 + A.T**2))
I believe these are much much more readable.

When you need loops, you need loops, of course, and python tends to suck here (though with jax's functional programming constructs the boundary is shifting). But the readability/cleanliness comparison in your post is still an unfair comparison.

If you're going to go through the effort to split things out into separate lines and name them, why the single letter names?
In mathematically oriented code, long variable names make code less readable, not more. Mathematical expressions become long and difficult to parse, and the variables often have no intrinsic meaning, so trying to make up 'meaningful names' is pointless.
Some things are just very common in math notation, enough that people recognize it better than fully named variables. Things like uppercase letter for a matrix, _T for a transpose, etc.
For someone with maths training (be it years in undergraduate engineering or whatever), one does get quite used to one letter names. Note that they always should have limited scope.
It's funny, but I find your example way more illustrative and deserving to be shared on HN, than the original post. Honestly, I don't find microbenchmarks that compelling, because there are too many caveats to take them seriously without really understanding what's happening under the hood. And I wouldn't participate in arguing about manual vs auto-vectorization, because these are just 2 different use-cases, so there's no question for me that auto-vectorization being possible in the language is just better than it being impossible. (I mean, disregarding the programming, writing kind-of-imperative formula when discussing math can serve a purpose, as it is a different perspective, than writing the same thing declaratively. I would prefer "vectorized" declarative form in the most cases, but surely one can find exceptions, where M[i,j] = ... is just much clearer.)

But your comment just vividly shows, how much proficiency in writing clear code actually matters.