Hacker News new | ask | show | jobs
by hexa00 2053 days ago
I work in a ML ecosystem ATM and concurrency is a major problem in python:

  - Threads can't be used efficiently because of the GIL

  - multiprocesses has to serialize everything in a single thread often killing performance. (Unless you use shared memory space techniques, but that's less than ideal compared to threads)

  - You can't use multiprocess while inside a multiprocess executor.  This makes building things on top of frameworks/libs that use multiprocess a nightmare...  e.g try to use a web server like over something like Keras... 
Those are the top reasons I don't like python but if you got appetite for more:

  - The dependency ecosystem is a pita, between python versions, package versions pinned or unpinned, requirements.txt, pipenv, poetry, conda... pick one and you're still sure to get into issues with other tools needing one system of another, or packages working a bit differently in conda etc... (I use poetry, with conda or pyenv)

  - The culture of let's write code easily is good to start with but it becomes a problem as people especially maybe in DS don't go further then that... and you end up with bad practices all over the place, un-testable code (the tests systems are also a pain to navigate), copy & pasted blobs etc...  Reading the code of some major libraries doesn't inspire confidence, especially compared to like Java, C++, go...
And last note I've seen way better emacs setup for python and presentations, it's ok as it is but I would not call it a Jimi Hendrix of python like a comment said...

I wish ML/DS would switch to Julia

3 comments

Could you give examples of where exactly in the ML process/lifecycle you're hitting these issues?

For example: "When training a [type] model with X characteristics, the GIL causes Y, which makes it impossible to do Z".

We're building our machine learning platform[0] to solve problems we have faced shipping ML products to enterprise, and are interested in your problems as well.

For example, we've faced the environment/dependencies/"runs on my machine" problems and have addressed these with Docker images. Our users can spin up a notebook server with near real-time collaboration to work with others, and no setup because the environment is there.

The same with training jobs: they can click on a button and schedule a long-running notebook that runs against a specific environment to avoid "just yesterday I had X accuracy on my machine". The runs are tracked, the models, parameters, and metrics are automatically tracked because if we rely on a notebook author to do it, they might forget or have to context switch and it's an added cognitive load.

Some problems we faced were during deployment, too, where a "data scientist" writes a notebook to train a model and then we had to deploy that model reading their notebook or looking into dependencies. Now they can click on a button and deploy whichever model they want. It really was hindering us because they were asking someone else's help, who may have been working on something else.

- [0]: https://iko.ai

In 2020 when core counts are going up and up I reach for Elixir where I might have used python in the past for these reasons.
I've been building a program that heavily uses multiprocessing for the past few months. It works quite well, but it did take me a little bit to figure out the best way to work with it.

> - Threads can't be used efficiently because of the GIL

Python's "threads" are actually fibers. Once you shift your thought process toward that then its easy enough to work with them. Async is a better solution, though, because "threads" aren't smart when switching between themselves. Async makes concurrency smart.

But if you want to use real threads, multiprocessing's "processes" are actually system threads.

> - multiprocesses has to serialize everything in a single thread often killing performance. (Unless you use shared memory space techniques, but that's less than ideal compared to threads)

I'm not quite sure what you mean. Multiprocessing's processes have their own GIL and are "single-threaded", but you can still spawn fibers and more processes from them, as well as use async.

Or are you talking about using the Manager and namespaces to communicate between processes? That is a little slow, yes. High speed code should probably use something else. Most programs will be fine with it, but it is way slower than rolling your own solution. However, it does work easily, so that's something to be said about it. Shared memory space techniques do work, too, but they are a little obtuse. Personally, I rolled my own data structures using the multiprocessing primitives. You have to set them up ahead of time, but they're insanely fast. Or you can use redis pubsub for IPC. Or write to a memory-mapped file.

- You can't use multiprocess while inside a multiprocess executor. This makes building things on top of frameworks/libs that use multiprocess a nightmare... e.g try to use a web server like over something like Keras...

I'm not sure what you mean. Multiprocessing simply spawns other Python processes. You can spawn processes from processes, so I don't know why you would have issues. Perhaps communication is an issue?

> - The dependency ecosystem is a pita

Yes, absolutely.

> Python's "threads" are actually fibers.

They’re actually not. They are native threads with high lock contention.

Async is arguably fibers, as are greenthreads in libraries like gevent or eventlet.

> But if you want to use real threads, multiprocessing's "processes" are actually system threads.

They’re system threads running in separate memory spaces. Also known as… processes.

You're right. To me they just feel like fibers because they can't run in parallel.
If you use numba (or cython, c extensions, etc) you can make them run without requiring that they hold the GIL, and they can run in parallel. Here's an example that should keep a CPU pegged at 100% utilization for a while:

  import numba as nb
  from concurrent.futures import ThreadPoolExecutor
  from multiprocessing import cpu_count

  @nb.jit(nogil=True)
  def slow_calculation(x):
      out = 0
      for i in range(x):
          out += i**0.01
      return out

  ex = ThreadPoolExecutor(max_workers=cpu_count())
  futures = [ex.submit(slow_calculation, 100_000_000_000+i) for i in range(cpu_count())]
> and they can run in parallel.

Even without requiring the GIL, these are still child threads of the main process, correct? And because of that, wouldn't the OS keep them all on the same core? And if that's the case, would ProcessPoolExecutor solve that problem?

I had no idea that existed, thank you!
No man. Python threads are not fibers. This is factually wrong. Please Read: https://wiki.python.org/moin/GlobalInterpreterLock