Hacker News new | ask | show | jobs
by onalark 4790 days ago
(I wrote this last blog post)

In general, we're looking for results that will get people excited, and in some communities, we can get fairly silly speedups over native Python, so we are still picking that low-hanging fruit.

However, it's also important to be tied in to real applications. I'd be happy to take a crack at applying numba to your code. Do you have a reasonably small example we could take a look at?

2 comments

The Ultrafast Shape Recognition (USR) algorithm is a very simple yet interesting application used in drug discovery that I tried speeding up with Numba (the similarity calculation part). The NumPy implementation looks roughly like this:

    def usr(X, y, S=0.9, N=10):
        scores = 1.0 / (1.0 + 1/12.0 * np.abs(X - y).sum(axis=1))
        scores = scores[scores>=S]
        scores.sort()
        
        return scores[-N:][::-1]
Where X.shape could be (2000000, 12) (or more rows) and y.shape (12,). The idea is to retrieve the top N most similar hits above a similarity score of S.
This is a good one :)

The numba code isn't as pretty as it could be because slicing doesn't work for overlapping memory regions or wraparound indexing yet, and we don't have inlining :(

Here's what I get on a 2.6 GhZ Intel Core i7.

I rewrote your code to minimize memory traffic, then jitted it with numba:

  def usr_numba(x, y, S, num_best):
    m, n = x.shape
    best = np.zeros(num_best)
    best_low = 0.0

    for i in xrange(m):
        d = abs(x[i,0]-y[0])
        for j in xrange(1,n):
            d += abs(x[i,j] - y[j])
        d = 1.0 / (1.0 + 1/12.0 *d)
        if d > best_low and d > S:
            k = 0
            for k in xrange(0,num_best):
                if d > best[k]:
                    break
            for l in xrange(num_best-1, k, -1):
                best[l] = best[l-1]
            best[k] = d
            best_low = best[num_best-1]
    return best

  _usr = autojit()(usr_numba)

  In [1]: import numba_usr

  jitted kernel checks out

  N = 1000000
  usr   (s): 0.233645
  numba (s): 0.0115586
  20X speedup

  N = 2000000
  usr   (s): 0.566954
  numba (s): 0.023487
  24X speedup

  N = 4000000
  usr   (s): 1.14992
  numba (s): 0.0472016
  24X speedup

  N = 8000000
  usr   (s): 2.34968
  numba (s): 0.092601
  25X speedup

  N = 10000000
  usr   (s): 2.96395
  numba (s): 0.116032
  26X speedup

  N = 20000000
  usr   (s): 17.4779
  numba (s): 0.236304
  74X speedup
For the case aschreyer is interested in, I see a 24x speedup from half a second to two hundredths of a second. For a really big problem (2 x 10^7), numba is still well under a second and the numpy code is starting to really suffer.

My full code is here: https://gist.github.com/ahmadia/5550933

I'm putting it into a wakari notebook so you can actually check me on this :)

Edit 1 - Made the speedup a little more comprehensible (and fixed gist)

Thanks, that was a good example indeed! Improving the memory traffic was actually really important because the real-life application has N=20M+, up to 200M. Around 85M calculations per seconds is pretty impressive I have to say and the example really helps to understand how to write efficient Numba code.
what's the speedup in english?
I'd love to give you a small code snippet to look at. Can you drop me a line at dansbecker (on the email domain run by google).
Sent.