Hacker News new | ask | show | jobs
by logicchains 680 days ago
The GIL causes a huge performance hit in data processing/ML by forcing the use of multi-process, which leads to a bunch of unnecessary copying of memory between processes unless you put in a bunch of effort to explicitly share memory. So in some cases the savings will be gigantic, from no longer unnecessarily copying huge dataframes between processes.
3 comments

But usually, in spaces where you need speed Python is just an orchestrator or glue between pipelines, and actually, calculations are done by db or some c/c++/fortran library.
Yes pandas/numpy calls C++ to do calculations efficiently, but the "glue" can still introduce significant slowdown relative to that when it's copying tens of gigabytes of dataframe unnecessarily between processes. Of course that slow part itself could also be moved to C++, but that's much more effort then just parallel mapping over the dataset in Python with no copying/multiprocessing, as will be possible with no-gil.
Bad code/quick hacks will always be slow (but can be great for prototypes), and sometimes it's worth planning how you're going to process something rather than piling on multiprocessing. Once you reach the point of multigigabyte IPC, it's worth spending the time doing it right.
Building libraries on a GIL-less Python would enable people to access that power without them all building it from scratch themselves.
GIL-less Python isn't magic pixie dust, the same group of users who have slow, poorly structured code are at best run into deadlocks. GIL-less Python can be used by well-designed libraries to achieve speedups, but that's not code written by the aforementioned pandas users, and speaking from experience, there's a lot more room for order of magnitude speedups from fixing quick hacks than running things in parallel, and usually it's a lot easier than managing multithreaded code.
> GIL-less Python can be used by well-designed libraries to achieve speedups, but that's not code written by the aforementioned pandas users

Yes, that's why having something like Pandas use it would be better than getting all users to write their own version.

If the libraries are thread safe can they not release the GIL to avoid copying.

I am pretty sure you are going to say there is a reason this cannot be done, would just like to know what it is!

What libraries? If you're writing some pandas code and want to parallelise some part of your data pipeline, as far as I'm aware Pandas doesn't have much support for that, you need to manually use multiprocessing to process different parts of the dataframe on different threads. Yes there are pandas alternatives that claim to be a drop-in replacement with better parallelism support, but the more pandas features you use, the more likely you are to depend on something they don't support, meaning you need to rewrite some code to switch to them.
But that's such a small fraction of total Python use, that it cannot serve as a validation to make it the default.
It is a fraction of usage that is commercially important to people who fund a lot of Python development.
Aka a power grab for short-term gain.