| I would consider the following optimizations first before attempting to rewrite an HTTP API since you already did the hard part: 1. For multiples processes use `gunicorn` [1]. Runs your app across multiple processes without you having to touch your code much. It's the same as having the n instances of the same backend app where n being the number of CPU cores you're willing to throw at it. One backend process per core, full isolation. 2. For multiple threads use `gunicorn` + `gevent` workers [2]. Provides multiprocessing + multithreaded functionality out of the box if you have IO intensive. It's not perfect but works very well in some situations. 3. Lastly, if CPU is where you have a bottleneck, that means you have some memory to spare (even if it's not much). Throw some LRU cache or cachetools [3] over functions that return the same result or functions that do expensive I/O. [1]: https://www.joelsleppy.com/blog/gunicorn-sync-workers/ [2]: https://www.joelsleppy.com/blog/gunicorn-async-workers-with-... [3]: https://pypi.org/project/cachetools/ |
1) gunicorn or any solution with multiple processes is going to just multiply the RAM usage. Using 10-100GB of RAM per effective thread makes this sort of problem very RAM bound, to the point that it can be hard to find hardware or VM support.
2) This isn't I/O bound.
3) If your service is fundamentally just looking up data in a huge in-memory data store, adding LRU caching around that is unlikely to make much of a difference because you're a) still doing a lookup in memory, just for the cache rather than the real data, and b) you're still subject to the GIL for those cache lookups.
I've also written services like this, we only loaded ~5GB of data, but it was sufficient to be difficult to manage in a few ways like this. The GIL-ectomy will probably have a significant impact on these sorts of use cases.