Hacker News new | ask | show | jobs
by JesseMeyer 1939 days ago
Cython has first class treatment for Numpy arrays. Can Mypyc generate machine optimized code for chomping Numpy arrays element-wise?
1 comments

I don’t think I want my toolchain to have first class knowledge of specific libraries...
Python is married to Numpy for scientific computing.
In my opinion it's this sort of short-sighted thinking that has cursed the Python project. "Everyone uses CPython" leads to "let's just let third party packages depend on any part of CPython" which leads to "Can't optimize CPython because it might break a dependency" which leads to "CPython is too slow, the ecosystem needs to invest heavily in c-extensions [including numpy]" which leads to "Can't create alternate Python implementations because the ecosystem depends concretely on CPython"[^1] and probably also the mess that is Python package management.

I'm not sure that the Numpy/Pandas hegemony over Python scientific computing will last. Eventually the ecosystem might move toward Arrow or something else. In this case it's probably not such a big deal because Arrow's mainstream debut will probably predate any serious adoption of Cython, but if it didn't then the latter would effectively preclude the former--Arrow becomes infeasible because everyone is using Cython/Numpy and Cython/Arrow performance is too poor to make the move, and since no one is making the move it's not worth investing in an Arrow special case in Cython and now no one gets the benefits that Arrow confers over Numpy/Pandas.

[^1]: Yes, Pypy exists and its maintainers have done yeoman's work in striving for compatibility with the ecosystem, and still (last I checked) you couldn't do such exotic things as "talking to a Postgres database via a production-ready (read: 'maintained, performant, secure, tested, stable, etc') package".

You are mixing up "how things are implemented" with "stuff that data scientists interact with."

Arrow is a low-level implementation detail, like BLAS. "Using" Arrow in data science in Python would mean implementing an Arrow-backed Pandas (or Pandas-like) DataFrame.

Your rank-and-file data scientist doesn't even know that Arrow exists, let alone that you can theoretically implement arrays, matrices, and data frames backed by it.

If you want to break the hegemony of Numpy, you will have to reimplement Numpy using CFFI instead of the CPython C API. There is no other way, unless you get everyone to switch to Julia.

Scientists are typically not trained computer scientists. They do not care, nor appreciate these technical arguments. They have two datasets A, and B, and want their sum, expressed in a neat tidy form.

C = A + B

Python with Numpy perfectly service just that need. We all have our grief with the status quo, but Python needs data processing acceleration from somewhere. In my view, Python needs to implement a JIT to alleviate 95% of the need for Numpy.

Scientists aren't the only players at the scientific computing table these days. There's increasing demand to bring data science applications to market, which implies engineering requirements in addition to the science requirements.

> In my view, Python needs to implement a JIT to alleviate 95% of the need for Numpy.

Numpy is just statically typed arrays. This seems like best case for AOT compilers, no? I'm all for JIT as well, but I don't have faith in the Python community to get us there.

JIT works great here too. It would see iteration and the associated mathematical calculations as a hotspot, and optimize only those parts, which is easy since the arrays are statically typed and sized.

I say this as a Computer Scientist at NASA that tends to re-write the scientific code in straight C. But for many workloads, a JIT would make my team more productive, basically for free as a user.

https://pythoncapi.readthedocs.io/roadmap.html

The hope is to create a new C API which doesn't expose CPython interpreter details, is easily exposed by interpreters other than CPython, and then port C-based APIs to it. Sadly it seems they aren't making much progress in 2020/2021. And I don't think it will eliminate Cython/Numpy overhead entirely, so Cython adding Numpy-specific features will still improve performance.

Also Pypy now has a compatibility shim for CPython extension modules. But last time I checked, it was slower than CPython for running one of my Numpy-based programs (corrscope), due to interfacing overhead.