Hacker News new | ask | show | jobs
Practical threaded programming with Python (ibm.com)
43 points by gklein 4697 days ago
4 comments

Please note that any use of python threads (in cpython, at least) will still use only one core at a time. So things datetime stuff, using Beautifulsoup, etc, will not be done in parallel.

To do stuff on more than one core, look at the multiprocessing module.

That is true. I think date was more of an example on how to use threads. But yeah, Beautifulsoup bit might actually run a bit slower if it is only doing parsing. (!unless there is a C extension underneath that does release the GIL!)

Any discussion of Python and threads is always confusing and it seems to me there aren't many people who understand how it works (I know you do, not talking about your comment, just in general).

In one camp you have people who like to write how Python is no good and threads are just broken. Never use them. In the other camp are people who say threads are fine, they work great, I never had any issues with them. A lot of time people in this camp are just reacting to the ones in the first camp but also without understanding the underlying mechanism.

I think the first thing that should be mentioned in any introductory article on Python threads is that Python threads work great for IO concurrency but they won't help with CPU concurrency. Things like downloading files, sending data over a socket will work nicely. Thing like computing determinants won't. You can still structure your code in a threaded fashion so can use multiprocessing module in the future but you won't get a speed up. So threads are not completely unusable and broken but they also have a surprising limitation.

Overall in Python in my career I probably deal more with IO concurrency and threads helped there quite a bit. Others will have a different experience depending on their area of expertise.

Also it is worth mentioning that libraries like Numpy and C extensions in general have the option of releasing the GIL if they want to they can get a speedup. I have done this once by hand and it did help (with a hand written extension). Didn't personally test numpy's speedup.

ADDITION:

It is also worth mentioning that even though you not get a speedup for CPU related concurrency, you still have to deal with synchronization issues. So you get the worse of both worlds. Just something to keep in mind.

I think the discussion around Python threads revolves too much on the performance. Imho Python threads work great when used for what they are suited for: managing control flow, or making the code more readable. Like if you have a task in your app that needs to be done every 5 minutes you could just make something like:

    class ThreadClass(threading.Thread):
        def run(self):
            while True:
                do_stuff()
                sleep(5*60)
and launch it to the background.
> I think the first thing that should be mentioned in any introductory article on Python threads is that Python threads work great for IO concurrency but they won't help with CPU concurrency. Things like downloading files, sending data over a socket will work nicely.

Python's threads certainly work acceptably for IO-bound purposes, but given the overhead of creating a real OS thread and the potential for GIL thrashing when using Python 2.x on a multicore machine, I'm not sure why you wouldn't favor a greenlet-based solution in most such cases, especially since you don't even really have to drop the threading idiom to do so.

Good point. I use gevent and now switching back to eventlet. But that is a different post perhaps.

Not only do you get more threads with greenlet you also don't have to worry about a whole class of synchronization side-effects since a greenlet will only switch contexts on an IO operation. (Now some might argue that is bad since you could be calling a function and not know what happens in side or what might happen in the future so you should lock anyway).

Out of curiosity, why the switch? I went with gevent pretty early on because it seemed like eventlet had some weird quirks, but I have to admit never really giving it a close look.
Because it is supported in older Pythons on some servers I work on. Works with PyPy. It has less dependencies (just greenlet) and easier to build.

gevent also has been thrashing around is it 1.0 beta? Switching to libev or libevent. And eventlet picked up more steam with more test coverage.

So no one big issue just a bunch of small ones.

To say "will still use only one core at a time" is a common misconception. It depends. CPython's GIL is not held during many file or network operations and may be released by C code. It's true that multiprocessing is generally a better solution for distributing CPU-bound tasks across cores, but let's not oversimplify.

And as some people may not know, some implementations like Jython do not have a GIL.

This should at the top. A CPython extension author need do no more than Py_BEGIN_ALLOW_THREADS before some native work, and Py_END_ALLOW_THREADS on completion, literally just 2 macros. In between the interpreter can run freely on another thread.

Consider a network server running a Twisted main loop, serving static files from cold mechanical disk. In this case, you could have 80 or more disk IO worker threads running without feeling any contention, assuming they're asleep for a minimum of 12ms/request, which would be an ideal seek time assuming the disks weren't under any load.

Assuming one disk, those 80 sleeping threads do something particularly useful that's difficult to accomplish without threads: they let the kernel IO scheduler reorder the requests to minimize seeks, and in a much more general way than the large variety of crappy AIO APIs available on UNIX.

There's a trillion uses like this for threads in Python.

The GIL is not an issue if you aren't interested in doing parallelism, and if you are then there are much better ways of doing it that don't rely on non-determinism (and thus are easier to reason about). You only really need python's threading library if you: 1) Need concurrency (not parallelism), 2) Don't need that many threads, and 3) Are okay with dealing with potential shared mutable state and synchronization of said state (the Queue library is pretty useful in that regard). For some insight on the difference between parallelism and concurrency see: http://ghcmutterings.wordpress.com/2009/10/06/parallelism-co...
The author mentions Twisted, and I wanted to show what it would look like to use Twisted: https://news.ycombinator.com/item?id=6220124
Even old, the examples given are good for those not familiar with Python threading. IBM Developer Works has been a surprisingly decent site for assorted topics/examples over the years.
The article is from 2008, so it applies to Python 2. I don't know enough about threading to say whether the same code will work in Python 3.
It will. If you just tweak some things in the examples like urllib2 imports and print-as-a-statement usage, the threading API is the same so it's still generally relevant.
it probably predates the multiprocessing package (introduced in 2.6, october 2008, although available 3rd party before that) and certainly doesn't mention it.

multiprocessing does use multiple cores, but is a heavyweight solution (separate, communicating processes, wrapped in a thread-like api). if you need multiple threads for cpu-related performance, it can be very useful.

python 2.6+, including 3.

http://toastdriven.com/blog/2008/nov/11/brief-introduction-m...

http://docs.python.org/2/library/multiprocessing.html