| This is a good one :) The numba code isn't as pretty as it could be because slicing doesn't work for overlapping memory regions or wraparound indexing yet, and we don't have inlining :( Here's what I get on a 2.6 GhZ Intel Core i7. I rewrote your code to minimize memory traffic, then jitted it with numba: def usr_numba(x, y, S, num_best):
m, n = x.shape
best = np.zeros(num_best)
best_low = 0.0
for i in xrange(m):
d = abs(x[i,0]-y[0])
for j in xrange(1,n):
d += abs(x[i,j] - y[j])
d = 1.0 / (1.0 + 1/12.0 *d)
if d > best_low and d > S:
k = 0
for k in xrange(0,num_best):
if d > best[k]:
break
for l in xrange(num_best-1, k, -1):
best[l] = best[l-1]
best[k] = d
best_low = best[num_best-1]
return best
_usr = autojit()(usr_numba)
In [1]: import numba_usr
jitted kernel checks out
N = 1000000
usr (s): 0.233645
numba (s): 0.0115586
20X speedup
N = 2000000
usr (s): 0.566954
numba (s): 0.023487
24X speedup
N = 4000000
usr (s): 1.14992
numba (s): 0.0472016
24X speedup
N = 8000000
usr (s): 2.34968
numba (s): 0.092601
25X speedup
N = 10000000
usr (s): 2.96395
numba (s): 0.116032
26X speedup
N = 20000000
usr (s): 17.4779
numba (s): 0.236304
74X speedup
For the case aschreyer is interested in, I see a 24x speedup from half a second to two hundredths of a second. For a really big problem (2 x 10^7), numba is still well under a second and the numpy code is starting to really suffer.My full code is here: https://gist.github.com/ahmadia/5550933 I'm putting it into a wakari notebook so you can actually check me on this :) Edit 1 - Made the speedup a little more comprehensible (and fixed gist) |