Hacker News new | ask | show | jobs
by jillesvangurp 2863 days ago
Did they ever fix the global interpreter lock? Sort of a show stopper with doing stuff concurrently in python. I've done a bit of batch processing using the multi process module; which uses processes instead of threads. This works but it is a bit of a kludge if you are used to languages that support concurrency properly.
3 comments

Concurrency and parallelism are two different things. Python is fine for concurrency.
And the article is about "Parallel Programming with Python", in order to "...take advantage of the processing power of multicore processors".
I believe that since the Advent of zeromq, parallelism is possible in almost any language, including python.

My library lets you do parallelism in a unique way, where you do message passing parallelism without being explicit about it.

https://github.com/pycampers/zproc/

You make some extremely large claims about ZProc, what advantages does it have over every other message-passing library for every other language ever built? (including the other zeromq bindings?)

TBH, you're claims sound like you've just "discovered" message-passing, of which many, many languages, runtimes and operating systems have been using for many years/decades. (https://en.wikipedia.org/wiki/Message_passing)

In other words... its not a revolution.

ZProc seems to simply be a simple library to pickle data structures thru a central (pubsub?) server.

This is not the way to get remotely close to "high performance". What you've created here is pretty much what multiprocessing gives you already in a more performant solution (i.e. no zeromq involved).

> What you've created here is pretty much what multiprocessing gives you already in a more performant solution (i.e. no zeromq involved)

Minor point of pedantry which I'll state because it's an often-overlooked timesaver for folks developing on multiprocessing: not only is MP potentially faster for transferring data between processes compared to this solution, but it can also be way, way faster in situations where you have all your data before creating your processes/pool and just want to farm it out to your MP processes without waiting for it all to be chunked/pickled/unpickled.

Because of copy-on-write fork magic, many multiprocessing configurations (including the default) can "send" that data to child processes in constant* time, if the data's already present in e.g. a global when children are created.

This pattern can be used to totally bypass all considerations of performance/CPU/etc. for pickling/unpickling data and lends a massive speed boost in certain situations--e.g. a massive dataset is read into memory at startup, and then ranges of that dataset are processed in parallel by a pool of MP processes, each of which will return a relatively small result-set back to the parent, or each of which will write its processed (think: data scrubbing) range to a separate file which could be `cat`ed together, or written in parallel with careful `seek` bookkeeping.

Unix-ish OSes only, though (unless the fork() emulation in WSL works for this--I have not tested that).

* Technically it's O(N) for the size of data you have in memory at process pool start, because fork() can take time, but the multiplier is small enough in practice compared to sending data to/from MP processes via queues or whatever that it might as well be constant.

> Because of copy-on-write fork magic

Note that this works for big objects, but not for small objects. E.g. if you fork-share a large list of integers or dicts or something like that, then you don't get any memory usage benefits, because every access will cause a refcount-write and that will copy the whole page containing the object.

> * Technically it's O(N) for the size of data you have in memory at process pool start

It's not quite that simple; sharing n pages can take very little time or a bit more time; it depends on how the pages are mapped; sharing a large mapping doesn't take longer than a small mapping.

> this works for big objects, but not for small objects

Very true; I went into some more detail about my typical use case above. Using MP for lots of small objects that you've already extracted from raw data/IO/whatever is a game of diminishing returns. It's in situations like that where traditional shared-memory starts looking more and more attractive. When I get to that point, while multiprocessing and some other packages provide a few nice abstractions over shmem, I start looking for other platforms than Python.

> It's not quite that simple; sharing n pages can take very little time or a bit more time

Definitely; I was simplifying in order to compare the overhead of fork with the overhead of pickling/shipping/unpickling data. Sharing large pieces of data with even very slow fork()ing is, in my experience, so much faster than the [de]serialize approach that it is effectively constant in comparison, but I didn't mean to discount the complexities of what make certain forking situations faster/slower than others.

> Because of copy-on-write fork magic, many multiprocessing configurations (including the default) can "send" that data to child processes in constant* time, if the data's already present in e.g. a global when children are created.

Have you tried this or got it working ? The fly in the ointment is the reference count. Add a reference and BOOM you suddenly have a huge copy. It can be made to work efficiently in certain cases but takes a lot of care.

In practice, I find reference-count related issues with this pattern to be minor.

Most of the situations where I care enough about memory and/or pickling overhead fall into the "take a giant block of binary/string data and process ranges of it in parallel" family, in which case there aren't too many references until the subprocesses get to work. If I had more complex structures of data I'd probably get a little less performance bang for my buck, but even then I suspect it would be much faster than multiprocessing's strategy: pickling and sending data between processes via pipes is many times slower than moving the equivalent amount of data by dirty-writing pages into a forked child.

That's not meant to discount anything y'all are saying, though: refcounts are definitely a very important thing to be mindful of in this situation. A child comment suggests gc.freeze, which can help, but can't entirely save you from thinking about this stuff.

It's also very important to be mindful of what happens with your program at shutdown: if you have a big set of references shared via fork(), and all your children shut down around the same time, your memory usage can shoot up as each child tries to de-refcount all objects in scope. This applies even if each child was only operating on a subset of the references shared to it. If you're processing, say, 1GB of data from the parent in 8 children on a 4 core system (doing M>N(cpu) because e.g. children spend some time writing results out to the FS/network), a near-simultaneous shutdown could allocate 9GB of memory in the very worst case, which can cause OOM or unexpected swapping behavior. Throttled shutdowns using a semaphore or equivalent are the way to go in that case.

You can call gc.freeze that effectively sets all reference counts to infinity.
Performance doesn't equal Better software.

In fact, I think Performance centric development is a lesser known evil.

> have all your data before creating your processes/pool

Zproc exposes the required API for this (Nothing new, just the python API) :)

https://zproc.readthedocs.io/en/latest/api.html#zproc.Proces... (args and kwargs)

> a massive dataset

Wouldn't you be better off using a Database for that kind of work?

> Because of copy-on-write fork magic, many multiprocessing configurations (including the default) can "send" that data to child processes in constant time

Any resources on how to implement that?

> Any resources on how to implement that?

  big_data = read_huge_binary_or_string()

  def process_range(rng):
    start, end = rng
    do_something(big_data[start:end])

  pool = multiprocesing.Pool(2)
  pool.map(process_range, [
    (0, 10000),
    (10001, len(big_data),
  ])
> "high performance"

I never claimed it to be performant!

"Above all, ZProc is written for safety and the ease of use."

(Read here - https://github.com/pycampers/zproc?files=1#faq)

> It's not a revolution

I totally agree. It's just a better way of doing things zmq already perfected. Like, tell me if you've ever seen a python object that has a `dict` API, but does message passing in the background.

> central (pubsub?) server.

Central server, yes. It uses PUB-SUB for state watching and REQ-REP for everything else.

> you've just "discovered" message-passing

Guess you're right? 2 years is a peanut on the time scale...

P.S. Thanks for all the feedback, I've been dying to hear something for a while now.

I would suggest you don't make dramatic claims for a subject that has decades of theory behind it with a huge amount of nuance depending on the exact workload and characteristics of the machines in question.

Don't get me wrong, message-passing has some advantages, but they certainly aren't that it 'solves' parallelism. If you wish to know more, investigate:

- Smalltalk and Erlang (for message passing languages).

- QNX (for a message-passing OS)

- mpiPY (for a message-passing Python library, mpi is the grandfather of message passing libraries that runs everywhere).

- Occam & the transputer for an example of a hardware-mp implementation (actually its Communicating Sequential Processes, but for your purposes it would be enlightening).

- golang for a modern-day implementation of CSP.

- Python implementation of CSP (https://github.com/futurecore/python-csp)

- Discussion about MP (http://wiki.c2.com/?MessagePassingConcurrency, for more just google it)

Basically, its great that you want to learn about concurrency & parallelism, but you've come to a gun fight with a blunt butter knife.

HN comment section shouldn't be a gun fight.
> I would suggest you don't make dramatic claims

If you could point out some stuff from ZProc's page, that would be nice!

> mpi is the grandfather of message passing libraries

Never heard of it before, but just a simple google search reveals that it _might_ be more performant than zmq, but not as fault-tolerant and flexible. It really looks like a niche thing, from this comment by peter hintjens

> Why smart cloud builders are betting everything on 0MQ. In detail, compare to the alternatives. Hand-rolling your own TCP stack is insane. Using any broker-based product won't scale. Buying licenses from IBM or TIBCO would eat up your capital. Supercomputing products like MPI aren't designed for this scale. There is literally no alternative.

(http://zeromq.org/docs:the-ten-minute-talk)

> Don't get me wrong, message-passing has some advantages, but they certainly aren't that it 'solves' parallelism.

Doesn't it? (For most people)

---

I can't believe I'm hearing words against zmq on HN, its wierd.

Even the guys over at Dask settled on ZMQ over anything - https://github.com/dask/distributed/issues/776

P.S. Seems like you know quite a lot about this topic. Do you have any projects of your own that I can see?

Bottom line, I think most people would be happy doing message passing parallelism in the real world. Sure, it doesn't look that good in theory but works damn good in practicality.

> My library lets you do parallelism in a unique way

That's a big claim which you don't really back up as much as you need to. Unique is an extremely high bar in this very busy field.

There are several other similar red flags on the linked GitHub; I think your enthusiasm is running away from you a little. You might want to dial the ten-dollar language back a bit – it made me immediately suspicious ("utterly perfect", for example is another danger phrase).

It's the combination of grandiose language + solution-in-search-of-a-problem which leads to that.

If you're going to sell hard, what I would want to see is a large, complex, high-traffic system which makes extensive use of this; if you compare and contrast with Ray, which I've also only just encountered in this thread, there's a real problem (distributed hyperparameter optimization) which they've built a solution for with the library, and that immediately lends it credibility; I know the system can be used for something because it has been.

'utterly perfect ' are not my words

http://zguide.zeromq.org/page:all#Multithreading-with-ZeroMQ

Thought linking it there would make it better, but I'll just remove it...

And you do make a good point. It doesn't really solve anything technically. But would you agree that it exposes a better API for doing much of the same stuff?

I wouldn't know without using it. That's where "software using this library" is a really useful bit of social proof. Think of Django; even without looking at the code you have a lot of evidence that it can conveniently solve a wide range of real problems.
Well HN I'd say is a pretty good place to raise social awareness
Update - I hopefully made the language a little better?

https://github.com/pycampers/zproc/blob/master/README.md

I think this is quite a lot better! Nice work.
>> Zproc uses a Server, which is responsible for storing and communicating the state. >> >> This isolates our resource (state), eliminating the need for locks.

So you've just invented a new name for a coordinator process and called it a new fashion in computation?

No, he's reinvented multiprocessing... pickling data structures across multiple processes.

Just without the 'niceties'.

You're probably right, but see my comment above: not only is MP possibly superior at being a picking/arbitrating server, but it also supports taking advantage of copy-on-write semantics on Unix-ish systems to transfer memory to children at startup in constant time with no pickling/unpickling necessary.
I agree, multiprocessing will be more performant than ZProc, much more thought has gone into it than the simple 0mq wrapper that is ZProc.
Great, now just "stating" how things work is equivalent to inventing them!
They did not, which is why this "course" illustrates taking advantage of multiple cores via multiprocessing without mentioning the GIL at all. Which is a little misleading if you think about it.

Also, by having the introductory chapter be about "functional programming" (which incidentally Python does not do well), he completely bypasses the serious issue of shared state.

Which goes to show that parallelism in Python is more like a gimmick than a real-world solution since it doesn't let you do in-process shared-memory processing via threads in parallel which is so important for many applications. In my case, the vast majority of the time I do not want to farm workers out to different operating system processes and deal with serialization and communication, but this is the only way for Python code to take advantage of multiple cores [1].

[1] Another way is to write a module in C and have Python code call into it on a new thread and release the GIL while doing so, but of course this is even worse pain-wise than doing it with multiprocessing and you end up writing/compiling C.

> deal with serialization and communication

I thought a lot about this problem, for over 2 years, and came up with zproc

https://github.com/pycampers/zproc

Basically,

> It lets you do message passing parallelism without the effort of tedious wiring.

You'll be doing message passing without ever dealing with sockets!

Also, Shared memory parallelism is hard to get right irregardless of which language you use. I would recommend strongly against it, unless you're writing some really really really niche thing where message passing is a bottleneck (it isn't most of the time)

The mantra that shared memory parallelism is hard to get right to the point where such platitudes as "unless you're writing some really really really niche thing" are uttered is entirely erroneous I find, through my own experience.

There are idiot-proof thread-safe datastructures and producer/consumer APIs that map extremely well to most problems that come up in practice in the domain, that one should confidently use. Refusing to do shared memory parallelism because of the _abstract potential for havoc_ rather than any practical justifications based on the problem-at-hand is throwing out the baby with the bathwater and is not the mark of competent engineering.

This talk (hopefully) conveys my point across

https://www.youtube.com/watch?v=9zinZmE3Ogk

You must be some sort of programming GOD, I guess.

The problem is that its _hard_ to get right.

For example - It's not trivial to use locks when you're working at an abstraction level higher than operating systems. Most people don't even realise there is a race in their application, because locks are inherently non-enforcing. Code written in locks is also really hard to read and reason by.

Message passing just makes it a little more trivial to avoid the pitfalls associated with parallel programming.

I also found that it lets you avoid busy waiting in certain places, which is always a performance advantage :)

Can you shed some light on those "idiot-proof thread-safe datastructures"?

I do concurrency in Java all the time with CompletableFuture and threadsafe data structures provided by various libraries, e.g. the Guava caches, and I rarely need to use locks or semaphores. It's a good set of abstractions that make concurrency pretty close to idiot-proof.

Futures in particular make it easy to write concurrent code close to the way you would write single-threaded code, because all of the threading is handled behind the scenes.

busy-waiting is a valid technique for some use-cases (and gives better performance in those situations) than other techniques.

Please research your topic.

Yes, but isn't it more CPU intensive?

(Speaking purely from experience. Don't have a fancy CS degree)

you claim "To make utterly perfect MT programs (and I mean that literally)".

you've rediscovered message-passing... please take an elementary CS course on parallel systems.

That claim is naive in the extreme.

That's not my claim man, its written in the zguide

http://zguide.zeromq.org/page:all#Multithreading-with-ZeroMQ

Maybe I should've just linked it there,sorry!

Okay, I will take that course and get back, thanks for the suggestion.

P.S. You just implied Pieter Hintjens is naive. You have to live with that now :(

I think you took that claim out of context:

"By "perfect MT programs", I mean code that's easy to write and understand, that works with the same design approach in any programming language, and on any operating system, and that scales across any number of CPUs with zero wait states and no point of diminishing returns."

That doesn't mean to say its "perfect" or "solves" multithreading, just that its easy to write and understand and portable across architectures. That says nothing of how optimal it is for concurrency or parallelism ease-of-use wise or performance-wise, just that its 'easy'.

> That doesn't mean to say its "perfect" or "solves" multithreading, just that its easy to write and understand

Try saying that out loud?

> Did they ever fix the global interpreter lock? Sort of a show stopper with doing stuff concurrently in python.

It means threas-based parallelism of pure-python code is unavailable; concurrency is just fine on Python.

I have to work with Python on Windows and believe me, concurrency is not just fine in Python when you can't use fork().
Obligatory "concurrency != parallelism" statement; concurrency is fine on both platforms with Python threading in a single process with a GIL; parallelism is less of a done deal.

While it's a very big hammer, consider experimenting with Celery for your parallelism needs on Windows. I've had good results using per-script Celery "clusters" with either a filesystem (on a ramdisk for extra speed) or an embedded Redis backend to accomplish pretty nice bidirectional RPC-ish parallelism. The initial setup is much more complicated than something like goroutines, but once you get it working you can boilerplate it onto other tasks without much trouble.

It still won't save you from memory constraints imposed by the lack of good fork() emulation, though. Hopefully the WSL stuff will either bring better fork() emulation, or allow support for shared memory objects (e.g. multiprocessing.Value) in order to ease some of that pain.