| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by logicchains 680 days ago
	The GIL causes a huge performance hit in data processing/ML by forcing the use of multi-process, which leads to a bunch of unnecessary copying of memory between processes unless you put in a bunch of effort to explicitly share memory. So in some cases the savings will be gigantic, from no longer unnecessarily copying huge dataframes between processes.

3 comments

antupis 680 days ago

But usually, in spaces where you need speed Python is just an orchestrator or glue between pipelines, and actually, calculations are done by db or some c/c++/fortran library.

link

logicchains 680 days ago

Yes pandas/numpy calls C++ to do calculations efficiently, but the "glue" can still introduce significant slowdown relative to that when it's copying tens of gigabytes of dataframe unnecessarily between processes. Of course that slow part itself could also be moved to C++, but that's much more effort then just parallel mapping over the dataset in Python with no copying/multiprocessing, as will be possible with no-gil.

link

aragilar 680 days ago

Bad code/quick hacks will always be slow (but can be great for prototypes), and sometimes it's worth planning how you're going to process something rather than piling on multiprocessing. Once you reach the point of multigigabyte IPC, it's worth spending the time doing it right.

link

robertlagrant 680 days ago

Building libraries on a GIL-less Python would enable people to access that power without them all building it from scratch themselves.

link

aragilar 680 days ago

GIL-less Python isn't magic pixie dust, the same group of users who have slow, poorly structured code are at best run into deadlocks. GIL-less Python can be used by well-designed libraries to achieve speedups, but that's not code written by the aforementioned pandas users, and speaking from experience, there's a lot more room for order of magnitude speedups from fixing quick hacks than running things in parallel, and usually it's a lot easier than managing multithreaded code.

link

robertlagrant 680 days ago

> GIL-less Python can be used by well-designed libraries to achieve speedups, but that's not code written by the aforementioned pandas users

Yes, that's why having something like Pandas use it would be better than getting all users to write their own version.

link

graemep 680 days ago

If the libraries are thread safe can they not release the GIL to avoid copying.

I am pretty sure you are going to say there is a reason this cannot be done, would just like to know what it is!

link

logicchains 680 days ago

What libraries? If you're writing some pandas code and want to parallelise some part of your data pipeline, as far as I'm aware Pandas doesn't have much support for that, you need to manually use multiprocessing to process different parts of the dataframe on different threads. Yes there are pandas alternatives that claim to be a drop-in replacement with better parallelism support, but the more pandas features you use, the more likely you are to depend on something they don't support, meaning you need to rewrite some code to switch to them.

link

tgv 680 days ago

But that's such a small fraction of total Python use, that it cannot serve as a validation to make it the default.

link

graemep 680 days ago

It is a fraction of usage that is commercially important to people who fund a lot of Python development.

link

tgv 679 days ago

Aka a power grab for short-term gain.

link