Hacker News new | ask | show | jobs
by rfoo 598 days ago
> Python is never really going to be 'fast' no matter what is done to it because its semantics make most important optimizations impossible

Scientific computing community have a bunch of code calling numpy or whatever stuff. They are pretty fast because, well, numpy isn't written in Python. However, there is a scalability issue: they can only drive so many threads (not 1, but not many) in a process due to GIL.

Okay, you may ask, why not just use a lot of processes and message-passing? That's how historically people work around the GIL issue. However, you need to either swallow the cost of serializing data over and over again (pickle is quite slow, even it's not, it's wasting precious memory bandwidth), or do very complicated dance with shared memory.

It's not for web app bois, who may just write TypeScript.

2 comments

This is misleading. Most of the compute intensive work in Numpy releases the GIL, and you can use traditional multithreading. That is the case for many other compute intensive compiled extensions as well.
No. Like the siblings said, say, you have a program which spends 10% time in Python code between these numpy calls. The code is still not scalable, because you can run at most 10 such threads in a Python process before you hit the hard limit imposed by GIL.

There is no need to eliminate the 10% or make it 5% or whatever, people happily pay 10% overhead for convenience, but being limited to 10 threads is a showstopper.

Python has no threads or processes hard limit, and the pure Python code in between calls into C extensions is irrelevant because you would not apply multithreading to it. Even if you did, the optimal number of threads would vary based on workload and compute. No idea where you got 10.
> No idea where you got 10.

Because of GIL, there may be at most one thread in a Python process running pure Python code. If I have a computation which takes 10% time in pure Python, 90% time in C extensions, I can only launch at most 10 threads, because 10 * 10% = 100%, and expect mostly linear scalability.

> the pure Python code in between calls into C extensions is irrelevant because you would not apply multithreading to it

No. There is a very important use case where the entire computation, driven by Python, is embarrassingly parallel and you'd want to parallize that, instead of having internal parallization in each your C extensions call. So the pure Python code in between calls into C extensions MUST BE SCALABLE. C extensions code may not launch thread at all.

This is your original comment, which as stated is simply incorrect

> numpy isn't written in Python. However, there is a scalability issue: they can only drive so many threads (not 1, but not many) in a process due to GIL.

Now you have concocted this arbitrary example of why you can't use multithreading that has nothing to do with your original comment or my response.

> instead of having internal parallelization in each your C extensions call ... C extensions code may not launch thread at all.

I don't think you understood my comment - or maybe you don't understand Python multithreading. If a C extension is single threaded but releases the GIL, you can use multithreading to parallelize it in Python. e.g. `ThreadPool(processes=100)` will create 100 threads within the current Python process and it will soak all the CPUs you have -- without additional Python processes. I have done this many times with numpy, numba, vector indexes, etc.

Even for your workload, using multithreading for the GIL-free code in a hierarchical parallelization scheme would be far more efficient than naive multiprocessing.

> This is your original comment, which as stated is simply incorrect

I apologize if I can't make you understand what I said. But I still believe I said it clearly and it is simply correct.

Anyway, let me try to mansplain it again, "numpy isn't written in Python" - and numpy releases GIL. So as long as a Python code calls into numpy, the thread on which the numpy call runs can go without GIL. The GIL could be taken by another thread to run Python code. I have easily have 100 threads running in numpy without any GIL issue. BUT, they eventually needs to return to Python and retake GIL. Say, for each such thread and for every 1 second there is 0.1 seconds they need to run pure Python code (and must hold GIL). Please tell me how to scale this to >10 threads.

> Now you have concocted this arbitrary example of why you can't use multithreading that has nothing to do with your original comment or my response.

The example is not arbitrary at all. This is exactly the problem people are facing TODAY in ANY DL training written in PyTorch.

I have a Python thread, driving GPUs in an async way, it barely runs any Python code, all good. No problem.

Then, I need to load and preprocess data [1] for the GPUs to consume. I need very high velocity changing this code, so it looks like a stupid script, reads data from storage, and then does some transformation using numpy / again whatever shit I decided to call. Unfortunately, as dealing with the data is largely where magic happens in today's DL, the code spend non-trivial time (say, 10%) in pure Python code manipulating bullshit dicts in between all numpy calls.

Compared to what happens on GPUs this is pretty lightweight, this is not latency sensitive and I just need good enough throughput, there's always enough CPU cores alongside GPUs, so, ideally, I just dial up the concurrency. And then I hit the GIL wall.

> I don't think you understood my comment - or maybe you don't understand Python multithreading. If a C extension is single threaded but releases the GIL, you can use multithreading to parallelize it in Python. e.g. `ThreadPool(processes=100)` will create 100 threads within the current Python process and it will soak all the CPUs you have -- without additional Python processes. I have done this many times with numpy, numba, vector indexes, etc.

I don't think you understood what you did before and you already wasted a lot of CPUs.

I never talked about doing any SINGLE computation. Maybe you are one of those HPC gurus who care and only care about solving single very big problem instances? Otherwise I have no idea why you are even talking about hierarchical parallelization after I already said that a lot of problem is embarrassingly parallel and they are important.

[1] Why don't I simply do it once and store the result? Because that's where actual research is happening and is what a lot of experiments are about. Yeah, not model architectures, not hyper-parameters. Just how you massage your data.

It’s an Amdahl’s law sort of thing, you can extract some of the parallelism with scikit-learn but what’s left is serialized. Particularly for those interactive jobs where you might write plain ordinary Python snippets that could get a 12x speedup (string parsing for a small ‘data lake’)

In so far as it is all threaded for C and Python you can parallelize it all with one paradigm that also makes a mean dynamic web server.

Numpy is not fast enough for actual performance sensitive scientific computing. Yes threading can help, but at the end of the day the single threaded perf isn't where it needs to be, and is held back too much by the python glue between Numpy calls. This makes interproceedural optimizations impossible.

Accellerated sub-languages like Numba, Jax, Pytorch, etc. or just whole new languages are really the only way forward here unless massive semantic changes are made to Python.

These "accelerated sub-languages" are still driven by, well, Python glue. That's why we need free-threading and faster Python. We want the glue to be faster because it's currently the most accessible glue to the community.

In fact, Sam, the man behind free-threading, works on PyTorch. From my understanding he decided to explore nogil because GIL is holding DL trainings written in PyTorch back. Namely, the PyTorch DataLoader code itself and almost all data loading pipelines in real training codebases are hopeless bloody mess just because all of the IPC/SHM nonsense.