Hacker News new | ask | show | jobs
by emn13 4861 days ago
As a counterpoint; a last project I wrote in college was a machine learning algorithm. By a rough comparison it was on the order of 10000 times faster in C++ than the preexisting matlab implementation. The cause was that the performance bottleneck was not in large matrix operations; instead, there were lots of iterative updates until convergence; this meant small vectors; a C++ template-based matrix library such as Eigen ends up inlining almost all of it into one no-allocation dense bit of math the traditional optimizer can milk for every last bit.

And it's not just about static/dynamic language differences here: practically, JIT might even do better by specializing the algorithm for a particular dimensionality, whereas that's impractical in C++ since you don't know the dimensionality until runtime.

Now, sometimes you can reduce your algorithm to some large-scale eigenvalue decomposition or whatever, and then numpy or similar might provide reasonable performance. But it's not a very general solution because performance on small structures is terrible (and iterative simple updates are common in many algorithms). JITted code relying on some underlying native library (like numpy) could never extract reasonable performance from this type of code; it would be forced to make many, many function calls in the innermost loop.

1 comments

Good points, certainly, but just to clarify: Numba is not particularly dependent on Numpy's built-in vectorized and matrix operations. Instead, it's using the datatype information to do JIT type inference over the functions being called with the matrix/array arguments, and building machine code for them. You can call Numba JITted functions from other Numba JITted functions, and the overhead is the same as C functions calling each other.
I've got no numba experience whatsoever, but if you're doing real function calls and memory allocations for simple things like multiplying small matrices, your code will be at least an order of magnitude slower than optimal, even in C. malloc's a big hit, and function calls often are too - not just because of the call itself (and the CPU cache hit that can involve), but no less significantly because they're opaque to the optimizer - and that means that the wrapping function is often optimized much less well.

It's not a fundamental issue, but I haven't seen a JIT do this particularly well, yet. All that inlining makes compiling slower, so to some extent the run-time nature of the JIT is an inherent limitation here.