Hacker News new | ask | show | jobs
by wmwmwm 1061 days ago
Historically I’ve written several services that load up some big datastructure (10s or 100s of GB), then expose an HTTP API on top of it. Every time I’ve done a quick implementation in Python of a service that then became popular (within a firm, so 100s or 1000s of clients) I’ve often ended up having to rewrite in Java so I can throw more threads at servicing the requests (often CPU heavy). I may have missed something but I couldn’t figure out how to get the multi-threaded performance out of Python but of course no-GIL looks interesting for this!
18 comments

I would consider the following optimizations first before attempting to rewrite an HTTP API since you already did the hard part:

1. For multiples processes use `gunicorn` [1]. Runs your app across multiple processes without you having to touch your code much. It's the same as having the n instances of the same backend app where n being the number of CPU cores you're willing to throw at it. One backend process per core, full isolation.

2. For multiple threads use `gunicorn` + `gevent` workers [2]. Provides multiprocessing + multithreaded functionality out of the box if you have IO intensive. It's not perfect but works very well in some situations.

3. Lastly, if CPU is where you have a bottleneck, that means you have some memory to spare (even if it's not much). Throw some LRU cache or cachetools [3] over functions that return the same result or functions that do expensive I/O.

[1]: https://www.joelsleppy.com/blog/gunicorn-sync-workers/

[2]: https://www.joelsleppy.com/blog/gunicorn-async-workers-with-...

[3]: https://pypi.org/project/cachetools/

These don't really apply to the parent commenter's scenario.

1) gunicorn or any solution with multiple processes is going to just multiply the RAM usage. Using 10-100GB of RAM per effective thread makes this sort of problem very RAM bound, to the point that it can be hard to find hardware or VM support.

2) This isn't I/O bound.

3) If your service is fundamentally just looking up data in a huge in-memory data store, adding LRU caching around that is unlikely to make much of a difference because you're a) still doing a lookup in memory, just for the cache rather than the real data, and b) you're still subject to the GIL for those cache lookups.

I've also written services like this, we only loaded ~5GB of data, but it was sufficient to be difficult to manage in a few ways like this. The GIL-ectomy will probably have a significant impact on these sorts of use cases.

For #1, would copy on write help? Or does python store the counters on the objects?
Ha! Yes! Unfortunately I know this because of terrible reasons. Python is reference counted so copy-on-write doesn't work for this with Python objects (note: if your Python object is actually just a reference to a native object in a library all bets are off, may work or may not).

We had an issue with the service I mentioned above where VMs with ~6GB RAM weren't working, because at the point that gunicorn forked there was instantaneously >10GB RAM usage because everything got copied. We had to make sure that the data file was only loaded after the daemon fork, which unfortunately limits the benefits of that fork, part of the idea is that you do all your setup before forking so that you know you've started cleanly.

> 1. For multiples processes use `gunicorn`

This will load up multiple processes like you say. OP loads a large dataset and gUnicorn would copy that dataset in each process. I have never figured out shared memory with gUnicorn.

> gUnicorn would copy that dataset in each process

Assuming you're on Linux/BSD/MacOS, sharing read-only memory is easy with Gunicorn (as opposed to actual POSIX shared memory, for which there are multiprocessing wrappers, but they're much harder to use).

To share memory in copy-on-write mode, add a call to load your dataset into something global (i.e. a global or class variable or an lru_cache of a free/class/static method) in gunicorn's "when_ready" config function[1].

This will load your dataset once on server start, before any processes are forked. After processes are forked, they'll gain access to that dataset in copy-on-write mode (this behavior is not specific to python/gunicorn; rather, it's a core behavior of fork(2)). If those processes do need to mutate the dataset, they'll only mutate their copy-on-write copies of it, so their mutations won't be visible to other parallel Gunicorn workers. In other words, if one request in a parallel=2 gunicorn mutates the dataset, a subsequent request has only a 50% likelihood of observing that mutation.

If you do need mutable shared memory, you could either check out databases/caches as other commenters have mentioned (Redislite[2] is a good way to embed Redis as a per-application cache into Python without having to run or configure a separate server at all; you can launch it in gunicorn's "when_ready" as well), or try true shared memory[3][4]

1. https://docs.gunicorn.org/en/stable/settings.html#when-ready 2. https://pypi.org/project/redislite/ 3. https://docs.python.org/3/library/multiprocessing.html#share... 4. https://docs.python.org/3/library/multiprocessing.shared_mem...

One way to achieve similar performance is redis or memcached running on the same node. It really depends on the workload too. If it is lookups by key without much post-processing, that architecture will probably work well. If it's a lot of scanning, or a lot of post-processing, in-process caching might be the way to go, maybe with some kind of request affinity so that the cache isn't duplicated across each process.
> I may have missed something but I couldn’t figure out how to get the multi-threaded performance out of Python

Multiprocessing. The answer is to use the python multiprocessing module, or to spin up multiple processes behind wsgi or whatever.

> Historically I’ve written several services that load up some big datastructure (10s or 100s of GB), then expose an HTTP API on top of it.

Use the python multiprocessing module. If you've already written it with the multithreading module, it is a drop in replacement. Your data structure will live in shared memory and can be accessed by all processes concurrently without incurring the wrath of the GIL.

Obviously this does not fix the issue of Python just being super slow in general. It just lets you max out all your CPU cores instead of having just one core at 100% all the time.

Multiprocessing is not a real solution, it’s a break-glass procedure when you just need to throw some cores at something without any hope for reliability. Unless something has changed since I used python, it is essentially a wrapper on Fork.

This means you need to deal with stuck/dead processes. I’ve used multiprocessing extensively and once you hit a certain amount of usage, even in a pool, you just get hangs and unresponsive processes.

I’ve also written a huge amount of Cython wrapped c++ code which releases the GIL. This never hangs and I can multithread there all I want without issue.

Why would they get stuck/dead and why wouldn't that happen with threads which might be even worse as they're more tightly bound? At least with zombies or inactive processes you can detect and kill them externally - if needs be.

Haven't played with multiprocess at scale, so am genuinely interested.

If subprocesses die (segfault maybe) it isn't uncommon for them to not be cleaned up and/or cause the parent process to hang while it waits for the zombie to respond. That's one I experienced last week on Python 3.9. A thread that experienced that would likely kill the parent process or maybe even exit with a stacktrace. Way easier to debug, and doesn't require me to search through running tasks and manually kill them after each debug cycle.

My impression is that the multiprocessing module is a heroic effort, but unfortunately making the whole system work transparently across multiple OSs and architectures is a nearly insurmountable problem.

You may be interested in the concurrent.futures library, available for over a decade now. It keeps you from shooting yourself in the foot like that.

https://docs.python.org/3/library/concurrent.futures.html

Why do you think it would help?

It provides a nice interface but is using multiprocessing or multi threading under the hood depending on which executioner you use:

> The ProcessPoolExecutor class is an Executor subclass that uses a pool of processes to execute calls asynchronously. ProcessPoolExecutor uses the multiprocessing module, which allows it to side-step the Global Interpreter Lock but also means that only picklable objects can be executed and returned.

Always setting a timeout on every IPC or network operation helps immensely. IIRC multiprocessing module allows that everywhere, but defaults to waiting forever in a couple of places.
Zombies don't respond, they merely have to be wait()'d for. Which should take microseconds at most.

I've seen orphaned processes sometimes idle, sometimes busy doing god knows what. But Zombies OTOH are rarely a problem, and should be able to be dealt with easily.

Perhaps the desire of Python to be Windows compatible mitigates against some design more suitable for Unix.

Yep, multiprocessing is a cope.

If processes were a universal substitute for threads we wouldn't have threads. That reasoning only gets stronger when you apply python's heavy limitations, but it gets the most strength when you experience the awkwardness of multiprocessing firsthand.

There isn't much difference on Linux between threads and processes that share memory. Multiprocessing is fine, it's just slightly more isolated threads.
That's why I took special care to mention how python's multiprocessing module was particularly poor.
multiprocessing is very good solution for scatter-and-gather (or map/reduce) type workloads: for example ssh to 1000 machines, run some commands, grab output, analyze output, done some action based on output, etc

if you are managing a fleet of machines and have some tasks to do on each machine, then multiprocessing is the life saver.

There is a "fork" mode and a "spawn" mode. Fork (the default) tends to result in broken process pools as you say, spawn seems to work a lot better but the performance is worse.
I’m not a huge fan of Cython and the like. It seems to be more natural to open a tcp connection to a c/c++ program and let that do the heavy lifting. Anything else seems like not a proper UNIX style solution.
That's not natural at all. Eg pybind11 is more natural.
I want to warn people against multiprocessing in python though.

If you're thinking about parallelizing your Python process, chances are your Python code is CPU-bound. That's when you should stop and think, is Python really the right tool for this job?

From experience, translating a Python program into C++ or Rust often gives a speed-up of around 100x, without introducing threads. Go probably has a similar level of speed-up. So while you can throw a lot of time fighting Python to get it to consume 16x the compute resources for a 10x speed-up, you could often instead spend a similar amount of time rewriting the program for a 100x speed-up with the same compute resources. And then you could parallelize your Go/Rust/C++ program for another 10x, if necessary.

Of course, this is highly dependent on what you're actually doing. Maybe your Python code isn't the bottleneck, maybe your code spends 99% of its time in datastructure operations implemented in C and you need to parallelize it. Or maybe your use-case is one where you could use pypy and get the required speed-up. I just recognize from my own experience the temptation of parallelizing some Python code because it's slow, only to find that the parallelized version isn't that much faster (my computer is just hotter and louder), and then giving in and rewriting the code in C++.

The first thing you should do is profile the code (py-spy is my preferred option) and see if there are any obvious hotspots. Then I'd actually look at the code, and understand what the structure is. For example, are you making lots of unnecessary copies of data? Are you recomputing something expensive you can store (functools.cache is one line and can make things much faster at the cost of memory)?

Once you've done that, then you should be familiar enough the code to know which bits are worth using multiprocessing on (i.e. the large embarrassingly parallel bits), which if they are a significant part of your code should scale near linearly.

The other thing to check is which libraries are you using (and what are your dependencies using). numpy now includes openblas (though mkl may be faster for your usecase), but sometimes you can achieve large speedups though choosing a different library, or ensuring speedups are being built.

Is there a better resource than the py-spy docs for figuring out how to use it?
>Use the python multiprocessing module. If you've already written it with the multithreading module, it is a drop in replacement. Your data structure will live in shared memory

Only if it can be immutable. So it can't be shared and changed by multiple processes as needed (with synchronization).

And even if you can have it mostly immutable, if you need to refresh it (e.g. after some time read a newer large file from disk to load into your data structure), you can't without restarting the whole server and processes.

So, it could work for this case, but it's hardly a general solution for the problem.

For this use case it would be better to put the data in a shared SQLite database than relying on multiprocessing CoW.

Even accessing objects from the shared memory would cause the reference counter to increment and the data would be copied, causing a memory usage explosion.

>For this use case it would be better to put the data in a shared SQLite database than relying on multiprocessing CoW

In Python yes. In Java you could take advantage of shared memory and get spared the overhead of SQLite.

Nowadays multiprocessing is rarely the answer. Between all the gotchas (memory usage can be horrific, have to be careful what you modify, etc.) it's almost never the right answer.

Nowadays numba is usually a better solution for when you want to run some computationally expensive python code that itself calls numpy, etc.

For the parent commenter's use case though that wouldn't be a great solution either. In general, Python does not have an optimal way of operating on a shared data structure across OS threads and certainly not in a way that doesn't require forking the interpreter.

You have to be much more careful about what you modify when using multithreading, so I'm not sure what you mean by that.

A lot of people here mention that sharing data is much easier with multithreading, but doing this without races is not easy.

You can't just use the values from difference threads like you would in normal code, you need to synchronize access with locks, which can be difficult to do correctly and can harm performance in a lot of cases.

I think a lot of the people who complain about the GIL are going to become acutely aware of why it was useful when they attempt to use GIL-less multithreading, and realize that removing it wasn't as great as it sounded at first!

In my experience, most problems are inherently synchronous with lots of mutable state and complex data dependencies, or inherently parallel with lots of tasks that can run independently. Problems that can be easily parallelized already work fine with multiprocessing! Problems that can't be easily parallelized are not something you can just slap some threading on to get more performance, and will require a lot of work to keep state synced!

This is just my opinion though and I'm sure there are plenty of domains that I don't have experience with that will benefit from no-GIL python!

> Problems that can be easily parallelized already work fine with multiprocessing!

Yeah, except afaik you pay more in context switches, sharing is more cumbersome. Also language runtime of a single process is likely working with less information, you end up using more memory on multiple language runtime instances

Frankly I'd just use Java or Go at that point and not even bother

Multithreading is hard but once you have been doing it a while, it becomes easy and most importantly, it’s stable.

When you have to deal with processes, there’s a lot of external factors out of your control because processes are much more visible and carry a lot of extra baggage.

Hard multithreading problems are fun. Hard multi-process problems are just tedious.

As I understand it on Linux processes and threads are implemented in almost the same way, just that threads share memory. I've heard it said several times that the idea that processes are "heavier" is a bit of a myth. I guess they need to allocate heap space and threads don't. I'm not an expert, just mentioning because it sounded like you might be believing something which is at odds with what people say about processes and threads on Linux.
I'm not a Linux kernel dev but I think this is true! Not sure what's up with the downvotes.

You can create a process/thread chimera with certain system calls, and get something that is in-between a thread and process if you want, which is neat but maybe not that useful.

Creating processes on Linux is actually much faster than people seem to realize. I can spawn at least a few thousand a second from a quick test of spawning bash instances.

Not sure why this is directed at my comment-- I didn't touch on synchronization.

Yes, locks like mutexes, semaphores, etc. and approaches like atomics, lockfree datastructures come into play when writing multithreaded code. There's no getting around that.

> In my experience, most problems are inherently synchronous with lots of mutable state and complex data dependencies, or inherently parallel with lots of tasks that can run independently. Problems that can be easily parallelized already work fine with multiprocessing!

This is a hot take though-- most problems that are truly embarrassingly parallel don't work as well as you'd think w/ multiprocessing. There's a ton of overhead there and when you do need synchronization steps (eg; in reductions) it can get pretty messy.

Over quite some time I've become convinced multiprocessing module is better than an optional GIL removal.

It may leave many useful bits on the table (compared to pure multithreaded coding, like C++/pthreads) but I've still been able to get it to scale my application performance (CPU-bound, large-memory) to the number of cores of even large boxes (96+ vCPUs). IIRC the future/concurrent library was key to being productive.

20 years ago I would said different, as at the time, IronPython demonstrated a real alternative to CPython that was faster, and fully multitrhreaded (including the container classes).

Sure, with multiprocessing you can get 96 python processes running at 100% CPU while sharing a large dataset.

Only problem is that 99% of that CPU usage is for serializing/deserializing IPC messages and total throughput would have been higher using a single process.

There are use-cases for multiprocessing. As long as data sharing between processes is insignificant, it can be quite performant. Just like using a bash-wrapper script that orchestrates a bunch of python (or other) processes.

Whatever happened to ironpython? I used to do a lot of C# development and remember dabbling with ironpython back in the day. It seemed like it was important to Microsoft, .Net added the whole concept of dynamic data types mostly to support ironpython and ironruby. But I never really used python much until recently, so of course when I finally needed to do python I looked for ironpython and it doesn’t appear to be a thing anymore.
It looks like Microsoft abandoned these dynamic language implementations in 2010. Maintaining parallel implementations of two complex, mature scripting languages is a huge feat. It would take some very expensive talent. That said, IronPython was loved by those who used it, which means it captured them in the DotNet ecosystem. Perhaps that win was not enough for Microsoft to continue the project. Ideally, Python foundation should "own" (and fund) Jython and IronPython development, but that takes (a lot of) money. (Sorry, I'm much less familiar with Ruby and IronRuby.)
It is still a thing, but it's open source now instead of maintained by Microsoft. There was a release that finally supports Python 3 in December last year.

I don't know how useful it is really, if you really want performance then you probably shouldn't choose Python to begin with, or you use the libraries which may not be compatible with IronPython. These days it barely takes me longer to build a simple script in C# than in Python either.

It's so so. Pythons core value is it's huge stack of lib's. And most important fall down with IP due to them using c and so on.

When we needed python c# interop it was better to use python.net and integrate that way. Annoying to setup but when it works you can get both to work seamlessly

I dont really partake in programming "wars", but the idea of launching a set of separate processes instead of separate threads to do a bunch of IOs has always seem to be weird to me. Yes, I have built software using Python. Yes, I have done things as you suggest. Now I use asyncio, since the syntax has matured and I finally understand coroutines, runners, tasks etc. Lets see where the GIL less Python takes us.
I'm confused. If you're doing a "bunch of IOs" then that's the situation where people use threads in Python, not processes. The argument for processes in Python is CPU-bound workloads.
Yup. I work at the Space Telescope Science Institute, where we maintain pipelines for astronomical data that move petabytes, among other things. All of the heavy lifting is done in Python.
Loading 100GB into RAM and then calling fork() is just painting a giant OOM Killer target on your back. It'll work until something breaks the CoWs or the parent gets restarted while some forks still linger or other fun things like that.

Threads make it transparent to the OS that this memory really must be shared between compute tasks.

While that does sometimes happen, I find the risk to be overstated. Most simple "allocate a large, complex data structure (e.g. dict of vectors of dataclasses) before creating a multiprocessing.Pool/Process/concurrent.futures.ProcessPoolExecutor and then refer to parts of it in the executor's jobs" work that deals in GBs of data does not suffer from copy-on-write-induced OOM issues in my experience. If the data in the shared memory isn't mutated in python, the refcount mutations are rarely enough to dirty more than a fraction of a percent of pages (though there are pathological allocation/reference schemes where that's not true).

If you do have memory issues, calling 'gc.freeze()' right before creating your multiprocessing.Pool/Process/concurrent.futures.ProcessPoolExecutor is sufficient to mitigate refcount-related page dirtying in the vast majority of cases. In the small remaining minority of cases, 'gc.disable()' as suggested by the freeze docs[1] may help. If that still doesn't do it, or if your page-dirtying is due to actual mutations of data (not just refcounts), it may be time to reach for actual shared memory instead[2][3].

1. https://docs.python.org/3/library/gc.html#gc.freeze 2. https://docs.python.org/3/library/multiprocessing.html#share... 3. https://docs.python.org/3/library/multiprocessing.shared_mem...

This exists, but one of two things happen, which still significantly slows things down. Either 1) you generate multiple python instances or 2) you push the code to a different language. Both are cumbersome and have significant effects. The latter is more common in computational libraries like numpy or pytorch, but in this respect it is more akin to python being a wrapper for C/C++/Cuda. Your performance is directly related to the percentage of time your code spends within those computation blocks otherwise you get hammered by IO operations.
You have to manually set up shared memory with its own API that has its own limitations, right? I thought some seamless integration was a new feature, but AFAICT, transfers between multiprocesses still leads to things being pickled and copied. Am I wrong?
> Am I wrong?

Only partially. When you send things to a multiprocessing.Pool/concurrent.futures.ProcessPoolExecutor, they're pickled and copied. "Sending" happens when passing arguments to e.g. "multiprocessing.Pool.apply_async()", "multiprocessing.Queue.put()" or "concurrent.futures.ProcessPoolExecutor.submit()".

However, there are two other ways to share data into your multiprocessing processes:

1. Copy-on-write via fork(2). In this mode, globally-visible data structures in Python that were created before your Pool/ProcessPoolExecutor are made accessible to code in child processes for (nearly) free, with no pickling, and no copying unless they are mutated in the child process. Two caveats here, which I've discussed in other comments on this thread: mutation may occur via garbage collection even if you don't explicitly change fork-shared data in Python[1]; and fork(2) is not used by default in multiprocessing on MacOS or Windows[2].

2. Using explicit shared memory data structures provided by Multiprocessing[3][4]. These do not incur the overhead (in CPU or copied memory) that pickle-based IPC does, but they are not without complexity or cost.

Unfortunately, truly "seamless integration" is not really possible with multiprocessing, so users will have to use one or more of the above strategies according to their application needs.

1. https://news.ycombinator.com/item?id=36940118 2. https://news.ycombinator.com/item?id=36941791 3. https://docs.python.org/3/library/multiprocessing.html#share... 4. https://docs.python.org/3/library/multiprocessing.shared_mem...

If you have a non trivial application, multiprocessing just takes a lot of memory. Every child process that you create duplicates the parent memory. There are some interesting hacks like gc.freeze that exploits the copy on write feature of forks to reduce memory, but ultimately you can just create a few hundred of processes compared to thousands of threads because of memory consumption.
>If you have a non trivial application, multiprocessing just takes a lot of memory. Every child process that you create duplicates the parent memory.

Not really, unless you want to alter it. The OS uses copy on write behind the scenes for forked processes, so will use the same memory locations already loaded until/if you modify that. So parent memory isn't really duplicated.

As for any new memory allocated by each child process, that's its own.

Unfortunately, python garbage collector messes up copy on write. Here’s a blog from instagram on how they fixed it - https://instagram-engineering.com/copy-on-write-friendly-pyt...
Unfortunately the generational GC modifies bits all over the heap, so you have to use some tricks to really leverage copy on write (as the commenter alludes to).
Fork's copy on write does not mix well with garbage collection.
The situation is a bit more complicated than this. While it's usually not the case that child processes always duplicate parent memory, that does happen on certain platforms (MacOS and Windows) on some Pythons. Additionally, the situation regarding unexpected page dirtying of copy-on-write memory is nuanced as well, which some of the sibling comments allude to.

I'll copy the tl;dr from another comment I've made nearby:

There are three main ways to share data into your multiprocessing processes:

1. By sending that data to them with IPC/pickling/copying, e.g. via "multiprocessing.Pool.apply_async()", "multiprocessing.Queue.put()" or "concurrent.futures.ProcessPoolExecutor.submit()".

2. Copy-on-write via fork(2). In this mode, globally-visible data structures in Python that were created before your Pool/ProcessPoolExecutor are made accessible to code in child processes for (nearly) free, with no pickling, and no copying unless they are mutated in the child process. Two caveats here, which I've discussed in other comments on this thread: mutation may occur via garbage collection even if you don't explicitly change fork-shared data in Python[1]; and fork(2) is not used by default in multiprocessing on MacOS or Windows[2].

3. Using explicit shared memory data structures provided by Multiprocessing[3][4].

1. https://news.ycombinator.com/item?id=36940118 2. https://news.ycombinator.com/item?id=36941791 3. https://docs.python.org/3/library/multiprocessing.html#share... 4. https://docs.python.org/3/library/multiprocessing.shared_mem...

Multiprocessing is great. But then every process keeps its own copy of hundreds of gigabytes of stuff. May be okay, depending on how many processes you spawn.

If the bulk of the data is immutable (or at least never mutated), it can be safely shared though, via shared memory.

> every process keeps its own copy of hundreds of gigabytes of stuff. May be okay, depending on how many processes you spawn

That depends on how you're using multiprocessing. If you're using the "spawn" multiprocessing-start method (which was set to the default on MacOS a few years ago[1], unfortunately), then every process re-starts python from the beginning of your program and does indeed have its own copy of anything not explicitly shared.

However, the "fork" and "forkserver" start methods make everything available in python before your multiprocessing.Pool/Process/concurrent.futures.ProcessPoolExecutor was created accessible for "free" (really: via fork(2)'s copy-on-write semantics) in the child processes without any added memory overhead. "fork" is the default startup mode on everything other than MacOS/Windows[2].

I find that those differing defaults are responsible for a lot of FUD around memory management regarding multiprocessing (some of which can be found in these comments!); folks who are watching memory while using multiprocessing on MacOS or Windows observe massively different memory consumption behavior than folks on Linux/BSD (which includes folks validating in Docker on MacOS/Windows). There's an additional source of FUD among folks who used Python on MacOS before the default was changed from "fork" to "spawn" and who assume the prior behavior still exists when it does not.

This sometimes results in the humorously counterintuitive situation of someone testing some Python code in Docker on MacOS/Windows observing far better performance inside Docker (and its accompanying virtual machine) than they observe when running that same code natively directly on the host operating system.

If you're on MacOS (not Windows) and wish to use the "fork" or "forkserver" behaviors of multiprocessing for memory sharing, do "export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES" in your shell before starting Python (modifying os.environ or calling os.setenv() in Python will not work), and then call "multiprocessing.set_start_method("fork", force=True)" in your entry point. Per the linked GitHub issue below, this can occasionally cause issues, but in my experience it does so rarely if ever.

1. https://github.com/python/cpython/issues/77906

2. https://docs.python.org/3/library/multiprocessing.html#conte...

Is what you're describing only true of the "Framework" Python build on MacOS? It sounds like that's the case from a quick read of the issue you linked. I would say that people should basically never use the "Framework" Python on MacOS. (There's some insanity IIRC where matplotlib wants you to use the Framework build? But that's matplotlib)
> Is what you're describing only true of the "Framework" Python build on MacOS?

No. This behavior is present on any Python 3.8 or greater running on MacOS, enforced via "platform == darwin" runtime check: https://github.com/python/cpython/pull/13626/files#diff-6836...

You can check the default process-start method of your Python's multiprocessing by running this command: "python -c 'import multiprocessing; print(multiprocessing.get_start_method())'"

Python is also going to get a JIT eventually, so they’re fixing that too! One of the concerns with no gil was that it would make certain optimisations harder for the JIT, but it’s very cool to see both being worked on.
Or just use a language that was actually designed to be something other than a scripting language?
> Multiprocessing. The answer is to use the python multiprocessing module, or to spin up multiple processes behind wsgi or whatever.

I assume mod_wsgi under apache was not the answer here due to memory constraints. That being said, why not serve from disk and use redis for a cache. This should work well unless the queries had high cardinality.

Serve what from disk? If they are using python, they are almost certainly writing am api server, not static files.
No, that’s about right.

The response, which isn’t technically wrong, is “unless you’re CPU bound, your application should be parallized with a WSGI. You shouldn’t be loading all that up in memory so it shouldn’t matter that you run 5 Python processes that each handle many many concurrent I/O bound requests.”

And this is kinda true… I’ve done it a lot. But it’s very inflexible. I hate programming architectures/patterns/whatnot where the answer is “no you’re doing it wrong. You shouldn’t be needing gigs of memory for your web server. Go learn task queues or whatever.” They’re not always wrong, but very regularly it’s the wrong time to worry about such “anti patterns.”

Yes, this is even more the case in languages that are popular with more "applied" programming audiences, like scientific computing. Telling them "no you should be using this complicated DBMS" (or whatever other acronym) is not productive.

It tends to get them exceptionally mad because their concern isn't the ideal way to write the code and architect the system, they simply want to write just enough code to continue their research, and even if they did care about proper architecture, they don't have the time or interest in learning/testing a new library for every little thing. They'd rather be putting that time reading up on their field of research.

This stance always rubbed me the wrong way a bit. Effectively, code is one of the tools a researcher uses to do their work. As soon as their work interacts with other people, for example when publishing a purportedly reproducible study or supplying novel algorithms to developers, they have a responsibility to deliver proper work that can be used and understood by other people. This is something we expect of every other profession, yet scientists appear to somehow have no concern for such lowly ambitions.

To be clear, I’m not advocating for data scientists to write production-grade webapps. But I absolutely think they should be bothered to write code that fulfills minimal requirements, is reproducible, documented, and mostly bug-free.

I think data scientists tend to have a lot of overlap with computer people so expectations for them may be a bit higher, my experience comes mainly from physicists.

Reproducible, documented and bug free is fine, they care plenty about those things too, the issue is the "no you're doing it the wrong way, use this entirely different technology instead" being based almost entirely on ideological reasons.

If we take C multithreading as an example, with my superivising scientist, multithreading is fine, he's willing to put some time into learning how it works because it's valuable and has had a stable interface backed by a reliable body for a while now. But if tomorrow you came up to him and insisted that doing multithreading was wrong without a solid technical reason (eg actual bugs and an explanation of how the only way to fix it is to dump the existing code and spend a few months redesigning) you'd get shot down.

Well, it's like showing your plan for painting a room, and asking "I seem to get stuck here after painting all but the corner, how do I get out of the corner?". The answer actually is "don't leave the corner for last".

Or like the martial arts student asking the master "how do I fight a guy 100m away with a rifle?" - "don't be there".

You have a single big data structure that can't be shared easily between multiple processes. Can't you use multiprocessing with that? Maybe mapping the data structure to a file and mmapping that in multiple processes? Maybe wrapping the whole thing in database instead of just using one huge nested dictionary? To me multi-threading sounds so much less painful than all the alternatives that I could imagine. Just adding multi-threading could give you >10x improvement on current hardware without much extra work if your data structure plays nice.
> You have a single big data structure that can't be shared easily between multiple processes. Can't you use multiprocessing with that? Maybe mapping the data structure to a file and mmapping that in multiple processes? Maybe wrapping the whole thing in database instead of just using one huge nested dictionary?

ton of additional complexity, not worth it for many use-cases and anything on the line of "using multiple processes or threads to increase python performance" does have (or at least did have) quite a bunch of additional foot guns in python

In that context porting a very trivial ad-hoc application to Java (or C# or Rust, depending on what knowhow exist in the Team) would faster or at least not much slower to do. But it would be reliable estimable by reducing the chance for any unexpected issues, like less perf then expected.

Basically the moment "use mmap" or "use multi-processing" is a reasonable recommendation for something ad-hocish there is something rally wrong with the tools you use IMHO.

How good is support for numpy / scipy / pandas or equivalents, if they exist, outside Python?

Actually the resulting structure should of course be dumped into an RDBMS or a graph DB and served from there more readily. Doing that takes skill and time though, which often are worth applying elsewhere.

The use case I'm thinking about is very simple: One big data structure that is mostly read from and sometimes written to. Use a single mutex with a shared lock for reading and an exclusive lock for writing. Then the readers are safe and would only block during updates when one writer is active. Everything else beside the data structure can be per-thread and wouldn't interfere.

The problem why we wouldn't want to port this application to another language is 100k lines of existing code that is best written in Python and no resources to rewrite all that.

> Basically the moment "use mmap" or "use multi-processing" is a reasonable recommendation for something ad-hocish there is something rally wrong with the tools you use IMHO.

Hmm. So you're saying only languages which bury lock and mutex over shared data are appropriate to use for async parallelism over shared data? Because calling explicit lock() and releae() isn't that hard. However it does incur a function call overhead. I suppose some explicit in language support could minimise that partially.

no I never said that
One annoying part with multiprocessing in Python is that you could abuse the COW mechanism to save on loading time when forking. But Python stores ref counters together with objects so every single read will bust your COW cache.

Now, you wanted it simple, but got to fight with the memory model of a language that wasn't designed with performance in mind, for programs whose focus wasn't performance.

There's gc.freeze for that now https://docs.python.org/3/library/gc.html#gc.freeze

If you load something big before forking workers, there's no CoW issue with that big structure anymore.

gc.freeze prevents considering the objects in gc, but doesn’t disable reference counting so you’ll still have CoW issues. PEP 683 introduces a way to make an object immortal which disables reference counting, which will address that issue.
I'd go for a db, yeah, or if that's a really painful mapping, this, erm, is actually the sort of thing Go is pretty good at it, and it's not too hard to write a fairly simple program that will traverse your data structure and communicate via a JSON api or something. That's a useful technique in general - separate the big heavy awkward thing from your main web processes.

While I hate how verbose and inexpressive it is, Go does hit a sweet spot of fairly good performance, even multi-core, while still being GCed so it's not nearly as foreign for a native python user.

It sounds I/O heavy, but you mention it being CPU-heavy in which case I’d say Python is just not the right tool for the job although you may be able to cope with multiprocessing.
Similar experience. Even with multi process and threads python is slow, very slow. Java, Go and .NET all provide a very performant out of box experience.
Python is both an interpreter, and quite dynamic. Both of these lead to lower performance when compared to less dynamic, compiled solutions. All of Java, Go, and .NET are compiled and (much) less dynamic.

This is absolutely an expected outcome.

These days even elisp can be compiled. I think python need to be dragged kicking and screaming into cutting edge '80s dynamic compilation technology.
I'm sure skilled volunteers would be very welcome.

There are numerous active, moderately serious efforts to both optimize and/or JIT Python bytecode. I think AOT compilation is mostly out-of-scope for 100% compatibility, but again, there's lots of different efforts to compile either subset languages or subsets of programs.

"Kicking and screaming" suggests some reluctance to embrace this, but I think that's probably unfair: it's just hard.

It isn't as if PyPy doesn't exist. Embracing it during the 16 years of its existence is another matter.
"absolutely an expected outcome."

Good day. Is it the right time to talk to you about Common Lisp?

To be fair, if you use CL in a similarly dynamic way as Python (don't compile anything, don't add any declarations etc) it won't be that much faster. You'll get some boost out of the stdlib stuff being compiled already, but otherwise it will incur similar performance penalties.
We can add Smalltalk, SELF, Dylan, JavaScript into the discussion then.
And maybe Strongtalk
Always a good time.
Node is pretty performant for anything IO related, not compiled and reasonably dynamic.
I think it's worth the clarification that Javascript is usually JITed; (C)Python isn't.

And that CPython's I/O isn't really the problem: some of its async event loop implementations are fairly competitive with Node.

But still ... yes.

Javascript has benefited from two decades of intensive, well-funded work by the best people in the business, with clear focus on performance as a high priority goal. Not to take away from those who work on Python, but I think it's fair to say the effort has had orders of magnitude difference.

I don't have a deep enough understanding to say whether the nature of Python or Javascript makes one better suited for performance optimization than the other. Python is perhaps able to benefit from seeing what's been done with Javascript, although of course Javascript has stood on the shoulders of its own giants.

3.11 and on should be comparable to Java for most use cases with multiprocessing (set up correctly of course)
How do you mean? 3.11 is something like 10-20% faster than earlier Python releases. Why should that make it comparable to Java? Typically Java is still several times faster than Python, and this is totally natural since Java performance benefits from static type declarations and the language is generally less dynamic than Python.

That said I still use Python for CPU intensive tasks since in my experience Numpy/Scipy/Numba etc does a good job speeding up the CPU intensive parts of Python code.

Static type declarations don't make Java fast. The compiler does. Dynamically typed languages with no type declarations can be very fast if the compiler can infer the types.

That's not to say that Python will ever get there. My understanding is that the design of the language and leaky implementation details make generally compiling Python to fast machine code nearly impossible.

Well, we already have a mature, real-world Python JIT in PyPy, with impressive performance.

I dunno if Python is ever gonna be as fast as Java or C#, but we know it can be much better.

I can't find any benchmarks of PyPy vs OpenJDK or GraalVM, but unless I'm mistaken it's still more than 100% difference, and maybe much, much more for pure-Python vs. Java.
My tip for this is Node.js and some stream processing lib like Highland. You can get ridiculous IO parallelism with a very little code and a nice API.

Python just scales terribly, no matter if you use multi-process or not. Java can get pretty good perf, but you'll need some libs or quite a bit of code to get nonblocking IO sending working well, or you're going to eat huge amounts of resources for moderate returns.

Node really excels at this use case. You can saturate the lines pretty easily.

0_o

Did I miss something? Does nodes/highland have good shared memory semantics these days?

I've always felt the best analogy to python concurrency was (node)js, but I admittedly haven't kept up all that well.

Wouldn't Elixir or Go be better for this use case? Node still blocks on compute heavy tasks.
I think they mentioned CPU intensive work, which I'm taking to imply that it's more CPU bound than I/O bound. So unless you're suggesting they use Node's web workers implementation for parallelism, the default single threaded async concurrency model probably won't serve them well.
Isn't Node single threaded, just like Python?
Python is technically multithreaded, but the GIL means only one thread can execute interpreter code at a time. If you use libraries written in C/C++, the library code can run in multiple threads simultaneously if they release the GIL.

I vaguely recall Node used to run multiple threads under the hood for disk I/O, but it might use kqueue/epoll these days.

Node is essentially a single-threaded API to a very capable multithreaded engine.

https://youtu.be/ztspvPYybIY

I am not too deeply experienced with Python so forgive my ignorance.

But I am curious to understand why you were not able to utilize the concurrency tools provided in Python.

A quick google search gave me these relevant resources

1. An intro to threading in Python (https://realpython.com/intro-to-python-threading/#conclusion...)

2. Speed Up Your Python Program With Concurrency (https://realpython.com/python-concurrency/)

3. Async IO in Python: A Complete Walkthrough (https://realpython.com/async-io-python/)

Forgive me for my naivety. This topic has been bothering me for quite a while.

Several people complain about the lack of threading in Python but I run into plenty of blogs and books on concurrency in Python.

Clearly there is a lack in my understanding of things.

Re (3): asyncio does not give you a boost for CPU bound tasks. It's a single-threaded, cooperative multi-tasking system that can (if you're IO bound) give you a performance boost.
Ehhh I mean you're not wrong, but I wouldn't say you're fully right either.

You can absolutely send stuff to a thread pool executor or process pool executor and then never await the returned value/never have it "return until interrupted, but the issues with shared memory (or really, the lack thereof in comparison to ex C) are still present to my understanding.

Then again, I mean you can always spin up a sqllite server or something on the same machine, but that's stupid heavy and more of a workaround than a solution. Super excited for nogil.

https://docs.python.org/3/library/concurrent.futures.html#co...

Not sure why you mention "thread pool executor", which of course does not get you concurrency due to the gil.
Pedantic nerd nitpick: it gives you concurrency but not parallelism. (Concurrent threads can be time sliced on one core)
It was clear from the context that he meant concurrently running not concurrently in progress. I wish nerds would give up on this parallelism/concurrency pedantry or at least choose some new nomenclature that didn't conflict so massively with the English meaning of "concurrent".

I mean it's not even right. Most parallel/concurrent pedants would consider multithreaded code to be "parallel" even if it is running on a single core.

I think the best thing is to talk about threads, because then you can distinguish e.g. OS threads and hardware threads.

You can throw python threads at it, but if each request traverses the big old datastructure using python code and serialises a result then you’re stuck with only one live thread at a time (due to the GIL). In Java it’s so much easier especially if the datastructure is read only or is updated periodically in an atomic fashion. Every attempt to do something like this in python has led me to having to abandon nice pythonic datastructures, fiddle around with shared memory binary formats, before sighing and reaching for java! Especially annoying if the service makes use of handy libraries like numpy/pandas/scipy etc!
The whole point of the GIL is that even if you use Python's threading or asyncio, you don't get any benefits from scaling beyond a single CPU core, because all of your threads (or coroutines) are competing for a single lock. They run "concurrently", but not actually in parallel. The pages you linked explain this in more detail.

In theory, multiprocessing could allow you to distribute the workload, but in a situation like OP describes -- just serving API requests based on a data structure -- the overhead of dispatching requests would likely be bigger than the cost of just handling the request in the first place. And your main server process is still a bottleneck for actually parsing the incoming requests and sending responses. So you're unlikely to see a significant benefit.

Threading in Python is fine if your threads are io bound or spend their time in a C extension which releases the GIL, if you are bound then the GIL means effectively one thread can run at a time and you gain no advantage from multiple threads.
I had this misunderstanding for a long time until I saw Go explain the difference: https://go.dev/blog/waza-talk

The confusion here is parallelism vs concurrency. Parallelism is executing multiple tasks at once and concurrency is the composition of multiple tasks.

For example, imagine there is a woodshop with multiple people and there is only one hammer. The people would be working on their projects such as a chair, a table, etc. Everyone needs to use the hammer to continue their project.

If someone needed a hammer, they would take the single hammer and use it. There are still other projects going on but everyone else would have to wait until the hammer is free. This is concurrency but not parallelism.

If there are multiple hammers, then multiple people could use the hammer at the same time and their project continues. This is parallelism and concurrency.

The hammer here is the CPU and the multiple projects are threads. When you have Python concurrency, you are sharing the hammer across different projects, but it's still one hammer. This is useful for dealing with blocking I/O but not computing bottlenecks.

Let's say that one of the projects needs wood from another place. There is no point in this project to hold on to the hammer when waiting for wood. This is what those Python concurrency libraries are solving for. In real life, you have tasks waiting on other services such as getting customer info from a database. You don't want the task to be wasting the CPU cycles doing nothing, so we can pass the CPU to another task.

But this doesn't mean that we are using more of the CPU. We are still stuck with a single core. If we have a compute bottleneck such as calculating a lot of numbers, then the concurrency libraries don't help.

You might be wondering why Python only allows for a single hammer/CPU core. It's because it's very hard to get parallelism properly working, you can end up with your program stalling easily if you don't do it correctly. The underlying data structures of Python were never designed with that in mind because it was meant to be a scripting language where performance wasn't key. Python grew massive and people started to apply Python to areas where performance was key. It's amazing that Python got so far even with GIL IMO.

As an aside, you might read about "multiprocessing" Python where you can use multiple CPU cores. This is true but there are heavy overhead costs to this. This is like building brand-new workshops with single hammers to handle more projects. This post would get even longer if I explained what is a "process" but to put it shortly, it is how the OS, such as Windows or Linux, manages tasks. There is a lot of overhead with it because it is meant to work with all sorts of different programs written in different languages.

That’s right.

In the past, for read-only data, I’ve used a disk file and relied on the the OS page cache to keep it performant.

For read-write, using a raw file safely gets risky quickly. And alternative languages with parallelism runs rings around python.

So getting rid of the GIL and allowing parallelism will be a big boon.

> I may have missed something

You did not miss anything. The GIL prevents parallel multi threading.

This is actually one of the reasons I was drawn to Ruby over Python. Ruby also has the GIL but jRuby is an excellent option when needed.
I wonder what lead to JRuby attracting support while Jython not? I know the Jython creator went on to other things (was it eg IronPython for dotnet?). I suppose it was the inverse with dotnet - eg IronPython surviving while IronRuby seems dead.

Is it just down to corporate sponsorship?

JRuby has been pretty actively maintained for about 15 years and had a big release this year.

It’s an impressive project.

I looked into it a long time ago (~10-12 years?), and was disappointed JRuby could not use extensions written in C. It's not surprising in retrospect, for obvious reasons, but has there been some progress in this area?
Twitter used JRuby and invested heavily for a time.
May I ask why you didn't consider writing that quick implementation in Java in the first place?
I don't think that Python was designed for this. I found it largely unsuited for such work. It is much easier to saturate IO with (random order) F#, Rust or Java (that I have used for in scenarios you mentioned).
If your data doesn't change, you can leverage HTTP caching and lift a huge burden off of your service.
Spin up as many processes as you need, map connections 1:1 to processes if possible.
You could have just use gunicorn and spawn multiple workers maybe
Why not load the data into sqlite dB and let the clients query that? Is there a reason you're loading 10s/100s gb into memory?
Are you just reading from this data structure? If so I wouldn't do any locking or threading, I'd just use asyncio to serve up read requests to the data and it should scale quite well. Multithreading/processing is best for CPU limited workloads but this sounds like you're really just IO-bound (limited by the very high IO of reading from that data structure in memory).

If you're allowing writes to the shared data structure... I'd ask myself am I using the right tool for the job. A proper database server like postgres will handle concurrent writers much, much better than you could code up hastily. And it will handle failures, backups, storage, security, configuration, etc. far better than an ad hoc solution.

> I'd just use asyncio to serve up read requests to the data and it should scale quite well.

Quoting GP:

>> often CPU heavy

We have to take their word for it that it's actually CPU heavy work, but if they're not lying and not mistaken then asyncio would do nothing for them.

Reading from memory is really not IO. Perhaps you're suggesting doing something like mmapping a file to memory, putting the data structure in that memory, and then using asyncio on the file to serve things, but this would only work if you can compute byte ranges inside the file to serve ahead of time, in which case there are much simpler solutions anyway. Most likely, when receiving a query they need to actually search through the datastructure based on the query, and it's very likely that this is the bottleneck, not just reading some memory.