| HN Mirror

This is a good one :)

The numba code isn't as pretty as it could be because slicing doesn't work for overlapping memory regions or wraparound indexing yet, and we don't have inlining :(

Here's what I get on a 2.6 GhZ Intel Core i7.

I rewrote your code to minimize memory traffic, then jitted it with numba:

  def usr_numba(x, y, S, num_best):
    m, n = x.shape
    best = np.zeros(num_best)
    best_low = 0.0

    for i in xrange(m):
        d = abs(x[i,0]-y[0])
        for j in xrange(1,n):
            d += abs(x[i,j] - y[j])
        d = 1.0 / (1.0 + 1/12.0 *d)
        if d > best_low and d > S:
            k = 0
            for k in xrange(0,num_best):
                if d > best[k]:
                    break
            for l in xrange(num_best-1, k, -1):
                best[l] = best[l-1]
            best[k] = d
            best_low = best[num_best-1]
    return best

  _usr = autojit()(usr_numba)

  In [1]: import numba_usr

  jitted kernel checks out

  N = 1000000
  usr   (s): 0.233645
  numba (s): 0.0115586
  20X speedup

  N = 2000000
  usr   (s): 0.566954
  numba (s): 0.023487
  24X speedup

  N = 4000000
  usr   (s): 1.14992
  numba (s): 0.0472016
  24X speedup

  N = 8000000
  usr   (s): 2.34968
  numba (s): 0.092601
  25X speedup

  N = 10000000
  usr   (s): 2.96395
  numba (s): 0.116032
  26X speedup

  N = 20000000
  usr   (s): 17.4779
  numba (s): 0.236304
  74X speedup

For the case aschreyer is interested in, I see a 24x speedup from half a second to two hundredths of a second. For a really big problem (2 x 10^7), numba is still well under a second and the numpy code is starting to really suffer.

My full code is here: https://gist.github.com/ahmadia/5550933

I'm putting it into a wakari notebook so you can actually check me on this :)

Edit 1 - Made the speedup a little more comprehensible (and fixed gist)