Hacker News new | ask | show | jobs
by c-fe 1537 days ago
I have heard about Cython before but I have never actually used it. I have however used Numpy, Scipy and Numba. Are there any reasons to also consider Cython in combination with those other libraries? E.g. in which cases would Cython be considerably better than Numpy or Numba? My workload consists mostly of data science and statistics, running models and simulations.
3 comments

Cython works great in conjunction with Numpy arrays and you can easily call numpy and scipy methods from within Cython. The big win comes when you have to do some operation to a numpy array that doesn't have a 'fast' path within numpy. If you ever find yourself in a situation where you have to loop over or apply any sort of custom operation to every element in a numpy array then Cython can be a huge win, especially since Cython also makes it possible to parallelise those loops.

The other place it shines is if you ever need to loop over an array of data that cannot easily be represented as numpy arrays, like strings or more complex structs. Here you can get significant speedups compared to python.

The third use of Cython I really like is with C and C++ interop. Sure there are lots of ways of calling C code from Python, but to me Cython is probably the quickest and cleanest.

Compared to Numba, it's harder to say. Numba, when it works, is easily as fast as Cython. However I find Numba hard to reason about and it's still a bit of a black box as to when and why it does and doesn't work. The nice thing about Cython is that it is pretty simple so you can easily reason about what it will do your code and how it will perform. It's been a long time since Cython 'surprised' me by performing much better or worse than I expected.

If you want to see Cython in action, take a look at the source code of scikit-image or scikit-learn. They implement many of their core algorithms in Cython

Numpy and Scipy do the heavy lifting in fast compiled C / Fortran, but if you write a for-loop doing these things, it will still be (comparatively) slow.

Numba is a JIT, and only covers some of Numpy. I'd say it's amazing at how well it works, but it "only" covers certain aspects of the language. It's also a bit of an all-or-nothing - if it doesn't cover a certain class of syntax, it just won't JIT.

Cython is ahead-of-time compiled, and much more comprehensive. It turns Python, effectively, into C, and compiles it as a Python extension. The possible scope is thus much greater, and although Cython comes with built-in support for Numpy, it is much more broad in principle.

So... it's a very different set of trade-offs. Like with Numba, out of the box, with no changes, you will typically see a significant improvement (what's significant? From experience about 2x). You have much more scope for tweaking your code to speed things up - move some of the execution to C, disable bounds checking, outright call C libraries, etc. It comes with a suite of tools for analysing performance bottlenecks. It used to come with a lot of special syntax, which nowadays is done with annotations and decorators - much neater IMO. And of course, no run-time compilation delay, it's moved to, well, compilation time.

Numba is better in my opinion for the use case you describe, less hassle.

However, (I think) cython is superior when:

- you want to distribute (eg as a pypi package) your code

- you want to interface with C/C++ code libs

I found out I almost never have to do this and did not touch cython since I started using numba.