Hacker News new | ask | show | jobs
by timtadh 4860 days ago
[Edit: the parent originally had a sentence about not understanding why people like Python for Scientific Computing. This was my response to that. The parent has now removed the sentence.]

We (the people using Python for Scientific Computing) like Python for the following reasons:

1. Numpy+Scipy+matplotlib+cvxopt is a very speedy environment. Its only real competitor for what it provides is MatLab. I have a colleague who bench marked Python vs. Matlab for our workload. Python is faster. (often because some of the algorithms used are newer than the equivalents in Matlab.)

2. It is a very productive environment. We do a lot of evolutionary changes and prototyping. Doing in this in C would slow us down in dev. time. This is academic work and mostly the code isn't important the analysis is.

3. We generally know where the "hot loops" are. Which is what we focus on for optimization. This generally involves doing math on paper. Then implementing it. If you turn loops in to matrix multiplications and use a good matrix library you get a great speed up.

1 comments

Sorry, I decided that it wasn't important before you replied. I am genuinely interested in why people use python for scientific computing, tho.

I have a colleague who bench marked Python vs. Matlab for our workload. Python is faster

Is it also faster than C? From my limited experience, it seems that people sometimes spend a lot of time on concurrency when faster code would have been easier.

This generally involves doing math on paper. Then implementing it.

Ah, yes, math always wins. This reinforces your point #2.

So, is #2 that much of a win? Do scientific programs spend more time in "development" than "production"?

> Is it also faster than C? From my limited experience, it seems that people sometimes spend a lot of time on concurrency when faster code would have been easier

It can reach FORTRAN speeds with the right tools. With Numba (http://numba.pydata.org/), your pure Python code gets compiled down to optimized machine code at call time, if your arguments are Numpy arrays. With NumbaPro (https://store.continuum.io/cshop/numbapro), we automatically parallelize for multi-core CPUs, and we emit CUDA/PTX for GPUs, and automatically exploit the parallelism in your data and algorithm.

The reason "higher level languages" can be faster than lower-level ones is because the compiler has more information about data parallelism. Typically "low level languages" are lower in that their type primitives are smaller, and hence the algorithms around those have turned vectorizable arrays into opaque for loops over arbitrary loop variables.

I certainly agree with you that many people now reach for distributed and parallel while leaving a lot of single-core and single-node performance on the table, mostly by ignoring the realities of memory bandwidth on modern CPUs. However, that level of efficiency is well within the reach of the Scientific Python stack. (See this blog post for how we're building a persistence format that respects memory hierarchy: continuum.io/blog/blz-format)

As a counterpoint; a last project I wrote in college was a machine learning algorithm. By a rough comparison it was on the order of 10000 times faster in C++ than the preexisting matlab implementation. The cause was that the performance bottleneck was not in large matrix operations; instead, there were lots of iterative updates until convergence; this meant small vectors; a C++ template-based matrix library such as Eigen ends up inlining almost all of it into one no-allocation dense bit of math the traditional optimizer can milk for every last bit.

And it's not just about static/dynamic language differences here: practically, JIT might even do better by specializing the algorithm for a particular dimensionality, whereas that's impractical in C++ since you don't know the dimensionality until runtime.

Now, sometimes you can reduce your algorithm to some large-scale eigenvalue decomposition or whatever, and then numpy or similar might provide reasonable performance. But it's not a very general solution because performance on small structures is terrible (and iterative simple updates are common in many algorithms). JITted code relying on some underlying native library (like numpy) could never extract reasonable performance from this type of code; it would be forced to make many, many function calls in the innermost loop.

Good points, certainly, but just to clarify: Numba is not particularly dependent on Numpy's built-in vectorized and matrix operations. Instead, it's using the datatype information to do JIT type inference over the functions being called with the matrix/array arguments, and building machine code for them. You can call Numba JITted functions from other Numba JITted functions, and the overhead is the same as C functions calling each other.
I've got no numba experience whatsoever, but if you're doing real function calls and memory allocations for simple things like multiplying small matrices, your code will be at least an order of magnitude slower than optimal, even in C. malloc's a big hit, and function calls often are too - not just because of the call itself (and the CPU cache hit that can involve), but no less significantly because they're opaque to the optimizer - and that means that the wrapping function is often optimized much less well.

It's not a fundamental issue, but I haven't seen a JIT do this particularly well, yet. All that inlining makes compiling slower, so to some extent the run-time nature of the JIT is an inherent limitation here.

There is no "production" in scientific programs. It runs once correctly to make the figure... more seriously, ontology is often a moving target, so the longer in takes to rewrite significant parts of the data structures, the less time there is to do science.

re: concurrency: I have a script that boots hundreds of IPython workers on hundreds of cores. I then make a client object (in antoher IPython shell), and map my 1e8 parameter configurations on to the cores, all in under a minute. This is much faster than rewritng in C.

I even implemented a special case of the brain simulator we've developed in Python (http://thevirtualbrain.org/) in C w/ unaliased pointer arithmetic etc. It's 50% faster but took more than 50% longer to write; on the other hand the PyCUDA implementation is 80x faster, and didn't take 80x, maybe 10x. Also a win because PyCUDA takes care of the uglier details.

so #2 is a big win

It's 50% faster but took more than 50% longer to write

At this point it's useful to know how long it takes to run, and how long to write. Is a run days long, months long, or years long? Or another way, is concurrency more expensive than a C re-programmer?

Also a win because PyCUDA takes care of the uglier details.

Is there not an analogous C++ library to take care of ugly details?

(I actually like python a lot, so there's a bit of devil's advocate going on. But, my longest running python programs take less than an hour.)

Typical simulations for us take between half a minute and several days, but this can depend because it's typically necessary to do a parameter sweep in several dimensions (leading in extreme cases to runtimes of several months on a cluster).

I believe Thrift (now shipped w/ CUDA SDK) makes things easier, but (since you know Python) nothing like NumPy exists in C++ and PyCUDA maps NumPy seamlessly into GPU computing, which is a big win.

If you are getting good results from CUDA and PyCUDA, you might want to take a look at Numba and NumbaPro: http://numba.pydata.org and http://continuum.io/numbapro. They are still in their early stages but work pretty well on a number of cases. Here is an example of what NumbaPro can do: http://docs.continuum.io/numbapro/generalizedufuncs.html#gen...

Numba is completely open source. NumbaPro is not open source, but it is free for academic users.

> Is there not an analogous C++ library to take care of ugly details?

No. In general, there isn't an analogous library at the more static-explicit languages (it doesn't matter much what library you choose). There are libs that people use when they have similar requisites, but they rarely are analogous.