Hacker News new | ask | show | jobs
by pwang 4863 days ago
> Is it also faster than C? From my limited experience, it seems that people sometimes spend a lot of time on concurrency when faster code would have been easier

It can reach FORTRAN speeds with the right tools. With Numba (http://numba.pydata.org/), your pure Python code gets compiled down to optimized machine code at call time, if your arguments are Numpy arrays. With NumbaPro (https://store.continuum.io/cshop/numbapro), we automatically parallelize for multi-core CPUs, and we emit CUDA/PTX for GPUs, and automatically exploit the parallelism in your data and algorithm.

The reason "higher level languages" can be faster than lower-level ones is because the compiler has more information about data parallelism. Typically "low level languages" are lower in that their type primitives are smaller, and hence the algorithms around those have turned vectorizable arrays into opaque for loops over arbitrary loop variables.

I certainly agree with you that many people now reach for distributed and parallel while leaving a lot of single-core and single-node performance on the table, mostly by ignoring the realities of memory bandwidth on modern CPUs. However, that level of efficiency is well within the reach of the Scientific Python stack. (See this blog post for how we're building a persistence format that respects memory hierarchy: continuum.io/blog/blz-format)

1 comments

As a counterpoint; a last project I wrote in college was a machine learning algorithm. By a rough comparison it was on the order of 10000 times faster in C++ than the preexisting matlab implementation. The cause was that the performance bottleneck was not in large matrix operations; instead, there were lots of iterative updates until convergence; this meant small vectors; a C++ template-based matrix library such as Eigen ends up inlining almost all of it into one no-allocation dense bit of math the traditional optimizer can milk for every last bit.

And it's not just about static/dynamic language differences here: practically, JIT might even do better by specializing the algorithm for a particular dimensionality, whereas that's impractical in C++ since you don't know the dimensionality until runtime.

Now, sometimes you can reduce your algorithm to some large-scale eigenvalue decomposition or whatever, and then numpy or similar might provide reasonable performance. But it's not a very general solution because performance on small structures is terrible (and iterative simple updates are common in many algorithms). JITted code relying on some underlying native library (like numpy) could never extract reasonable performance from this type of code; it would be forced to make many, many function calls in the innermost loop.

Good points, certainly, but just to clarify: Numba is not particularly dependent on Numpy's built-in vectorized and matrix operations. Instead, it's using the datatype information to do JIT type inference over the functions being called with the matrix/array arguments, and building machine code for them. You can call Numba JITted functions from other Numba JITted functions, and the overhead is the same as C functions calling each other.
I've got no numba experience whatsoever, but if you're doing real function calls and memory allocations for simple things like multiplying small matrices, your code will be at least an order of magnitude slower than optimal, even in C. malloc's a big hit, and function calls often are too - not just because of the call itself (and the CPU cache hit that can involve), but no less significantly because they're opaque to the optimizer - and that means that the wrapping function is often optimized much less well.

It's not a fundamental issue, but I haven't seen a JIT do this particularly well, yet. All that inlining makes compiling slower, so to some extent the run-time nature of the JIT is an inherent limitation here.