Hacker News new | ask | show | jobs
by aschreyer 4788 days ago
It is a bit unfortunate that you have to "unroll" your code to get the most out of Numba. Hopefully at some point it will be able to translate NumPy functions such as np.abs(X - y).sum(axis=1)) efficiently into LLVM. Those extreme performance improvements are misleading though; in my experience the speed up compared to concise, optimal NumPy code is more in the range 50-100%.
2 comments

Numba's original purpose was not to speed up concise, optimal NumPy code. The fact that it can actually do that is an extra bonus. Numba's purpose was and still remains best at allowing you to still write the "hard-to vectorize" algorithms that require for-loops without having to use C/C++ or Fortran.

If you never are resorting to using C/C++ or Fortran, then Numba is not necessarily going to be helpful for you.

I don't think we are trying to mislead. We are simply showing examples that illustrate what Numba can be used for. Of course your code does not have many for-loops in it because you never could write code like that and have it be fast in Python. The big-deal about Numba is that now you can. You don't have to rely on a C/C++ library that either write or get someone else to write for you. You can just use Python.

I agree with you about the "unrolling". I don't recommend adding for-loops when good vectorized expressions work. Keep using them as we will be able to do a better job of parallelizing that code in the future. However, when you can't vectorize and for-loops are all you have, Numba is great.

We have spent some time on optimizing NumPy expressions like you describe and some array expressions can be translated down to equivalent LLVM loops. But most of the effort in that direction has been happening in the Blaze project which is generalizing NumPy so it can deal with much larger data but still produce optimized code from those vector expressions. This actually requires a re-thinking of how NumPy is implemented from the ground up.

I'm sure numba is an impressive product. But I keep seeing examples on the continuum blog with an obscene number of for loops, and correspondingly great speedups.

My only response is: "My code doesn't look like that."

I'd be more impressed if they didn't cherry pick code with a dozen for-loops, and instead showed a moderate speedup from more standard (and vectorized) code.

(I wrote this last blog post)

In general, we're looking for results that will get people excited, and in some communities, we can get fairly silly speedups over native Python, so we are still picking that low-hanging fruit.

However, it's also important to be tied in to real applications. I'd be happy to take a crack at applying numba to your code. Do you have a reasonably small example we could take a look at?

The Ultrafast Shape Recognition (USR) algorithm is a very simple yet interesting application used in drug discovery that I tried speeding up with Numba (the similarity calculation part). The NumPy implementation looks roughly like this:

    def usr(X, y, S=0.9, N=10):
        scores = 1.0 / (1.0 + 1/12.0 * np.abs(X - y).sum(axis=1))
        scores = scores[scores>=S]
        scores.sort()
        
        return scores[-N:][::-1]
Where X.shape could be (2000000, 12) (or more rows) and y.shape (12,). The idea is to retrieve the top N most similar hits above a similarity score of S.
This is a good one :)

The numba code isn't as pretty as it could be because slicing doesn't work for overlapping memory regions or wraparound indexing yet, and we don't have inlining :(

Here's what I get on a 2.6 GhZ Intel Core i7.

I rewrote your code to minimize memory traffic, then jitted it with numba:

  def usr_numba(x, y, S, num_best):
    m, n = x.shape
    best = np.zeros(num_best)
    best_low = 0.0

    for i in xrange(m):
        d = abs(x[i,0]-y[0])
        for j in xrange(1,n):
            d += abs(x[i,j] - y[j])
        d = 1.0 / (1.0 + 1/12.0 *d)
        if d > best_low and d > S:
            k = 0
            for k in xrange(0,num_best):
                if d > best[k]:
                    break
            for l in xrange(num_best-1, k, -1):
                best[l] = best[l-1]
            best[k] = d
            best_low = best[num_best-1]
    return best

  _usr = autojit()(usr_numba)

  In [1]: import numba_usr

  jitted kernel checks out

  N = 1000000
  usr   (s): 0.233645
  numba (s): 0.0115586
  20X speedup

  N = 2000000
  usr   (s): 0.566954
  numba (s): 0.023487
  24X speedup

  N = 4000000
  usr   (s): 1.14992
  numba (s): 0.0472016
  24X speedup

  N = 8000000
  usr   (s): 2.34968
  numba (s): 0.092601
  25X speedup

  N = 10000000
  usr   (s): 2.96395
  numba (s): 0.116032
  26X speedup

  N = 20000000
  usr   (s): 17.4779
  numba (s): 0.236304
  74X speedup
For the case aschreyer is interested in, I see a 24x speedup from half a second to two hundredths of a second. For a really big problem (2 x 10^7), numba is still well under a second and the numpy code is starting to really suffer.

My full code is here: https://gist.github.com/ahmadia/5550933

I'm putting it into a wakari notebook so you can actually check me on this :)

Edit 1 - Made the speedup a little more comprehensible (and fixed gist)

Thanks, that was a good example indeed! Improving the memory traffic was actually really important because the real-life application has N=20M+, up to 200M. Around 85M calculations per seconds is pretty impressive I have to say and the example really helps to understand how to write efficient Numba code.
what's the speedup in english?
I'd love to give you a small code snippet to look at. Can you drop me a line at dansbecker (on the email domain run by google).
Sent.