| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lalaland1125 685 days ago

Optimizing Python extensions is becoming increasingly important as Python is used in more and more compute intensive environments.

The key for optimizing a Python extension is to minimize the number of times you have to interact with Python.

A couple of other tips in addition to what this article provides:

1. Object pooling is quite useful as it can significantly cut down on the number of allocations.

2. Be very careful about tools like pybind11 that make it easier to write extensions for Python. They come with a significant amount of overhead. For critical hotspots, always use the raw Python C extension API.

3. Use numpy arrays whenever possible when returning large lists to Python. A python list of python integers is amazingly inefficient compared to a numpy array of integers.

8 comments

RhysU 685 days ago

> Optimizing Python extensions is becoming increasingly important as Python is used in more and more compute intensive environments.

I have always loved how the trick to making Python better eventually comes down to not writing Python.

Spivak 685 days ago

Is this not expected? You're never going to have any language with the kind of dynamism that Python/Ruby/JS have while also having performant number crunching simply because Python has to do more significantly more work for the same line of code. You could envision a world where a JIT could recognize cases where all that dynamism falls away and you can generate code similar to what you would get in the equivalent C but that's just a fancy way to not write Python again. You would be writing in this informal not super well defined restricted subset of Python that JITs cleanly.

The problem time immortal is language complexity vs the ability to hint to your compiler that it can make much stronger assumptions about your code than it has to assume naturally which is where we got __slots__. And there's lots of easy wins you could get in Python that eliminate a significant amount of dynamism-- you could tell your compiler that you'll never shadow names for example, that this list is fixed size, that you don't want number promotion but they all require adding stuff to the language to indicate that.

When you're looking from the bottom up you end up making different trade-offs. Because while you get nice primitives that generate very tight assembly when you need that dynamism you end up having this object model that exists in the abstract that you orchestrate and poke at but don't really see, like gobject. Ironically, HN's love-to-hate language C++ gives you both simultaneously but at the cost of a very complicated language.

rurban 685 days ago

> You're never going to have any language with the kind of dynamism that Python/Ruby/JS have while also having performant number crunching simply because Python has to do more significantly more work for the same line of code.

Wrong. With strong types you do have ability to tell the compiler that most of the dynamic checks and hooks can be omitted, and values can stay unboxed. Python, ruby, perl choose to ignore types so far. And Javascript, PHP did at least dynamic optimizations with their type hints.

halfcat 684 days ago

>> You're never going to have any language with the kind of dynamism that Python/Ruby/JS have while also having performant number crunching

> Wrong

Can you give examples of languages that achieve both? Or is this all just on a spectrum? Like if we say C# is performant, it’s still the case that (eventually) the way to make it faster is “stop writing C#”.

neonsunset 684 days ago

> Like if we say C# is performant, it’s still the case that (eventually) the way to make it faster is “stop writing C#”.

That has stopped being true a few years ago and in some cases was never true.

The way for a faster C# codebase is writing faster C#. .NET CoreLib including all performance-sensitive paths like memmove is written in pure C#, and the VM (by that I mean all reflection bits, TypeLoader, etc.) itself, excluding GC, is also pure C# when you are using NativeAOT.

The optimization techniques to achieve this are but not limited to using monomorphized struct generics, stack buffers, arenas and memory pooling, using SIMD API (which has the same performance characteristics as intrinsics in C/C++), not allocating by using structs or making object lifetime GC friendly if allocations cannot be avoided, making otherwise safe code bounds check elision friendly, reducing indirection, etc. Many of these are exact same as what you would do in languages like C, C++ or Rust.

As a result, the use of FFI to call into C/C++/Rust/ObjC/Swift(upcoming native Swift ABI support)/etc. today is predominantly relegated to accessing necessary OS APIs and libraries.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

Of course most of these optimizations are at odds with "dynamism" and yield the speed-up by making the dispatch static and inlining-friendly, and giving the compiler more information it can prove and make use of. Not to mention C# (together with Swift) sits the closest to the metal among otherwise high-level languages by virtue of what .NET is and what its ILC and JIT compile IL to.

halfcat 684 days ago

So, no examples? :)

That’s my gut feeling, that writing fast C# basically ends up looking like C++/Rust/etc where the niceties of C# are no longer present.

Which is the same as rewriting Python in cython. It’s way faster, and it doesn’t look like Python, or have the benefits of Python, and now just looks like weird C.

stabbles 685 days ago

Why not go all the way and limit the times you have to interact with Python to zero ;)

mkoubaa 685 days ago

Because if your users want python, you have to convince them they don't want it. If you fail, you'll have made optimized code that nobody uses.

Another strategy is to actually serve your users

coldtea 685 days ago

Because then you have the warts of the new language, and the pain of migrating code, which could be 100s or millions of lines, to worry about...

rbanffy 684 days ago

People have been trying to move away from COBOL since I was a kid and I’m not even young.

0x63_Problems 685 days ago

Totally agree, keeping the interface with the extension as thin as possible makes sense.

I hadn't considered object pooling in this context, it might be more involved since each node has distinct data but for my use case it might still be a performance win.

Have you ever used pyo3 for rust bindings? I haven't measured the overhead but I have been assuming that it's worth the tradeoff vs. rolling my own.

(I'm the author)

hansvm 685 days ago

My last workplace used pyo3 for a project. It was slower than vanilla Python, and you picked up all the normal compiled-language problems like slow builds and cross-compilation toolchains.

I wouldn't take away from that observation that pyo3 is slow (it was just a poor fit; FFI for miniscule amounts of work), but the fact that the binding costs were higher than vanilla Python computations suggests that the overhead is (was?) meaningful. I don't know how it compares to a hand-written extension.

lifthrasiir 685 days ago

That's pretty surprising, because I have also extensively used PyO3 in my daily job and it was quite performant. Your comment does seem to suggest that you were also using the `numpy` crate or similar in addition to `pyo3`, which performance might be more variable than I would expect for PyO3 though. (I personally minimized the use of `numpy` for that reason, but didn't have a particular performance issue with it anyway.)

ameliaquining 685 days ago

I'd definitely be curious to know what specific runtime operations PyO3 inserts that you don't have to do with the C API. Naively it doesn't seem like there should be any, since Rust has zero-overhead C FFI.

hansvm 684 days ago

Sorry, "FFI" was a shorthand for "mixing and matching two languages' GC expectations, memory layouts, ..." and all the overhead associated with merging something opinionated, like Rust, with something dynamic, like Python. You almost certainly _can_ reduce that overhead further, but unless somebody has gone out of their way to do so, the default expectation for cross-language calls like that should be that somebody opted for maintainable code that has actually shipped instead of shaving off every last theoretical bit of overhead.

It's been a few years, so I really can't tell you exactly what the problem was (other than the general observation that you should try to do nontrivial amounts of work in your python extensions rather than trivial amounts), but PyO3 agrees with the general sentiment [0] [1], or at least did at roughly the same time I was working there.

[0] https://github.com/PyO3/pyo3/issues/679

[1] https://github.com/PyO3/pyo3/issues/1470

ameliaquining 681 days ago

You'd also run into this if you wrote your native extension in C, right?

tomjakubowski 685 days ago

re: 3, Python has a native numeric array type https://docs.python.org/3/library/array.html

raymondh 685 days ago

We should probably get rid of that. It is old (predating numpy) and has limited functionality. In almost every case I can think of, you would be better off with numpy.

jph00 685 days ago

If you don't want to add a dep on numpy (which is a big complex module) then it's nice to have a stdlib option. So there are certainly at least some cases where you're not better off with numpy.

coldtea 685 days ago

Even better if Python adds a mainline pandas/numpy like C-based table structure, with a very small subset of the pandas/numpy functionality, that's also convertable to pandas/numpy/etc.

ameliaquining 685 days ago

What kind of subset would you have in mind? I think that any kind of numeric operation would be off the table, for the reasons given in PEP 465:

"Providing a quality implementation of matrix multiplication is highly non-trivial. Naive nested loop implementations are very slow and shipping such an implementation in CPython would just create a trap for users. But the alternative – providing a modern, competitive matrix multiply – would require that CPython link to a BLAS library, which brings a set of new complications. In particular, several popular BLAS libraries (including the one that ships by default on OS X) currently break the use of multiprocessing."

im3w1l 685 days ago

Numpy is incredibly widespread and basically a standard so I would propose: It should have exactly the same layout in memory as a numpy array. It's fine if it has a very limited set of operations out-of-the-box. Maybe something like get, set, elementwise-arithmetic. Work with numpy project to make it possible to cast it to numpy array to help the common case where someone is fine with a dep on numpy and wants the full set of numpy operations.

coldtea 684 days ago

The best they can do without BLAS. Doesn't have to be as fast as numpy, just faster and more memory efficient than doing it in native Python, without the dependency.

KptMarchewa 684 days ago

It just should be native support for Apache Arrow.

dumah 685 days ago

A performant table data structure with powerful and convenient syntax for interaction is one great feature Q has that Python lacks.

f33d5173 685 days ago

array is for serializing to/from binary data. It isn't useful for returning from a library because the only way a python programmer can consume it is by converting into python objects, at which point there is no efficiency benefit. numpy has a library of functions for operating directly on the referenced data, as well as a cottage industry of libraries that will take a numpy array as input. Obviously someone might end up casting it to a list anyways, but there is at least the opportunity for them to not do that.

sgarland 684 days ago

multiprocessing.shared_memory.ShareableList can be useful in some circumstances, even if you don’t intend on sharing it across processes. It allows direct access to the data, elements are mutable (to an extent; you can’t increase the size of either the overall list or its elements once built), and since the underlying shm is exposed, you can get memoryviews for zero-copy.

The downside is they’re on the more esoteric side of Python, so people may not be as familiar with them as other structures.

dumah 685 days ago

Cython implements a C api for accessing the underlying data structure.

Arrays implement the buffer interface so they can be used efficiently with tools like numpy.

sgarland 684 days ago

Re: 3, you can also use Python’s array.array in some circumstances. If you have heterogeneous types, don’t need multiple dimensions, and don’t need Fortran memory layout, they’re a good choice IMO, and one that doesn’t require pulling in 3rd party packages.

jmkr 684 days ago

I have thought Python's arrays have been overlooked for years. So much so that people call a list an array.

alkh 685 days ago

Re: 2, is there any good repo with raw C Python API that can be used as a reference for someone who is not too proficient in C? I took a look at numpy but it seems too complicated for me

maxmorlocke 685 days ago

I've found rapidfuzz to be a good, digestable C/Python integration. It's especially nice as the algorithms implemented in C frequently have good pseudocode or other language representations, so you can reference really well. The docs are in reasonable shape as well:

https://github.com/rapidfuzz/RapidFuzz

jay-barronville 685 days ago

You mind elaborating on what exactly you’re looking for? Maybe I can help point you in the right direction, but right now, it’s not clear given your description.

lifthrasiir 685 days ago

> 2. Be very careful about tools like pybind11 that make it easier to write extensions for Python. They come with a significant amount of overhead. For critical hotspots, always use the raw Python C extension API.

Agrees on the broader point (and I don't like pybind11 that much anyway), but the raw Python C extension API is often hard to use correctly. I would suggest that you should at least have a rough idea about how higher-level libraries like pybind11 would translate to the C API, so that you can recognize performance pitfalls in advance.

> 3. Use numpy arrays whenever possible when returning large lists to Python. A python list of python integers is amazingly inefficient compared to a numpy array of integers.

Or use the `array` module in the standard library if that matters. numpy is not a small library and has quite a large impact on the initial startup time. (Libraries like PyTorch are even much worse to be fair, though.)

dumah 685 days ago

3. The memoryview interface is often a good solution.

lalaland1125 685 days ago

I think you mean the buffer interface?

I think the buffer interface is too complex to provide directly to users. I think an API that returns numpy arrays is simpler and easier to understand.

sgarland 684 days ago

Memoryviews abstract the buffer interface as an object, so perhaps that’s what was meant.

I disagree with the inclination to jump to numpy. I much prefer minimizing 3rd party libraries, especially if the performance is equivalent or nearly so.

jmkr 684 days ago

I discovered memoryview when looking at the JACK Python library. It is pretty neat. But also one of those things I wouldn't have known to look for.