Hacker News new | ask | show | jobs
by alfalfasprout 1061 days ago
Nowadays multiprocessing is rarely the answer. Between all the gotchas (memory usage can be horrific, have to be careful what you modify, etc.) it's almost never the right answer.

Nowadays numba is usually a better solution for when you want to run some computationally expensive python code that itself calls numpy, etc.

For the parent commenter's use case though that wouldn't be a great solution either. In general, Python does not have an optimal way of operating on a shared data structure across OS threads and certainly not in a way that doesn't require forking the interpreter.

1 comments

You have to be much more careful about what you modify when using multithreading, so I'm not sure what you mean by that.

A lot of people here mention that sharing data is much easier with multithreading, but doing this without races is not easy.

You can't just use the values from difference threads like you would in normal code, you need to synchronize access with locks, which can be difficult to do correctly and can harm performance in a lot of cases.

I think a lot of the people who complain about the GIL are going to become acutely aware of why it was useful when they attempt to use GIL-less multithreading, and realize that removing it wasn't as great as it sounded at first!

In my experience, most problems are inherently synchronous with lots of mutable state and complex data dependencies, or inherently parallel with lots of tasks that can run independently. Problems that can be easily parallelized already work fine with multiprocessing! Problems that can't be easily parallelized are not something you can just slap some threading on to get more performance, and will require a lot of work to keep state synced!

This is just my opinion though and I'm sure there are plenty of domains that I don't have experience with that will benefit from no-GIL python!

> Problems that can be easily parallelized already work fine with multiprocessing!

Yeah, except afaik you pay more in context switches, sharing is more cumbersome. Also language runtime of a single process is likely working with less information, you end up using more memory on multiple language runtime instances

Frankly I'd just use Java or Go at that point and not even bother

Multithreading is hard but once you have been doing it a while, it becomes easy and most importantly, it’s stable.

When you have to deal with processes, there’s a lot of external factors out of your control because processes are much more visible and carry a lot of extra baggage.

Hard multithreading problems are fun. Hard multi-process problems are just tedious.

As I understand it on Linux processes and threads are implemented in almost the same way, just that threads share memory. I've heard it said several times that the idea that processes are "heavier" is a bit of a myth. I guess they need to allocate heap space and threads don't. I'm not an expert, just mentioning because it sounded like you might be believing something which is at odds with what people say about processes and threads on Linux.
I'm not a Linux kernel dev but I think this is true! Not sure what's up with the downvotes.

You can create a process/thread chimera with certain system calls, and get something that is in-between a thread and process if you want, which is neat but maybe not that useful.

Creating processes on Linux is actually much faster than people seem to realize. I can spawn at least a few thousand a second from a quick test of spawning bash instances.

Not sure why this is directed at my comment-- I didn't touch on synchronization.

Yes, locks like mutexes, semaphores, etc. and approaches like atomics, lockfree datastructures come into play when writing multithreaded code. There's no getting around that.

> In my experience, most problems are inherently synchronous with lots of mutable state and complex data dependencies, or inherently parallel with lots of tasks that can run independently. Problems that can be easily parallelized already work fine with multiprocessing!

This is a hot take though-- most problems that are truly embarrassingly parallel don't work as well as you'd think w/ multiprocessing. There's a ton of overhead there and when you do need synchronization steps (eg; in reductions) it can get pretty messy.