Hacker News new | ask | show | jobs
Presentation on the new Python GIL (dabeaz.com)
77 points by voxcogitatio 5999 days ago
4 comments

I have a slightly weird and probably completely unworkable "idea" that has nothing to do with this particular GIL improvement but with the broader GIL issue.

The problem with removing the GIL seems to be garbage collection. I wonder why it's not possible to introduce a new type of object reference that exempts referenced objects from being garbage collected.

Then multiple Python interpreters could be started in separate threads, each with its own unmodified GIL, and the only thing they could access would be their own thread local data and those special shared objects.

What this amounts to is basically an implementation of a multi process architecture on top of a multi threaded architecture. The crucial difference is that the memory shared among the interpreters could hold pointers and thus proper in memory data structures and not just a BLOB into which everything has to be serialized as in the case of conventional shared memory.

Of course the shared objects would have to be manually deleted.

One issue I see is that when such a special object is created, all the objects it creates recursively would also have to be allocated in that pool of special objects. But I think it should be possible to use some kind of global flag to special case the allocator.

Well, there is probably a huge number of issues with this kind of trickery and it's definately not a long term solution. But I'd love to use Python much more than I do and the GIL issue is what prevents that for me at the moment.

> The problem with removing the GIL seems to be garbage collection.

I'm not sure this is the real problem; after all, reference counts can be atomically incremented and decremented.

The real problem is that primitive operations in Python (like "foo.bar") cannot safely be performed in C without locking, because you need the hash table to remain consistent while you are doing lookups and/or insertions. This forces you to either wrap all such operations in locks (which has been tried, and slows down the single-threaded case by something like 2x) or reimplement them with lock-free data structures. The latter could be an interesting experiment; you could probably implement a tree-based map using RCU.

As does perl. You need to declare variables as shared, there is apparently some performance hit, but it's mostly localized to the shared variables, so reasonably manageable. I've actually found it easier to work with than implicit shared data across all threads because you actually need to think about the exposed surface area and how you can minimize it.
People write (http://www.julmar.com/blog/mark/PermaLink,guid,3670d081-0276...) that even .NET itself has a similar kind of GC (partially concurrent, only address fixup stops the world). Unfortunately I can't check how it compares to Singularity.
I'm pretty sure someone built a Python like that, but it did not work out. Someone whose name I don't recall from the #python freenode channel. I think perhaps C module were one of the issues, or sharing of global scopes (which then need fine-grained locks to synchronized access).

BTW, there's one multi-processing shared objects module here: http://poshmodule.sourceforge.net/

Knowing little of Python, is it the case that the GIL was not initially deemed a bad scheme because multi-processor x86 machines were a rarity in the mid-90's (when Python was young)? I presume that Sun desired Java to be multiprocessor friendly because it had multi-cpu SPARC machines at the time of Java's design?

And, for those using Python for high-volume production systems, how do you cope? Do you run one process (say a Zope instance) per core and load balance?

Keep in mind the GIL is only a problem when you a) use threads and b) perform CPU-bound operations (unless you need soft real time)

When writing CPU-bound code, there are many libraries like NumPy that can do your heavy lifting in native code (thus avoiding the GIL)

So it really isn't as bad as it seems. The reason this patch is good, is because it prevents performance from falling apart if one of your threads takes up too much CPU.

Yep, a process per core.

The beauty is that since your app has to be written to be multiprocess capable then you can also extend past one computer when the need arises.

You pay a memory tax overhead though.

"""Does it work? Yes it's better: Sequential 23.5 seconds. Threaded 24.0 seconds."""

How is that better? It is an improvement over earlier versions of Python. Better would be if the threaded version would actually be twice as fast.

If you look at the code he used to test this, its completely CPU-bound, meaning that it requires the GIL to run. And since the GIL is, well, a "GIL," that means that only one can run at any single time.

Therefore, the expected behavior of running these two threads would be to have identical runtimes. That would be the "perfect" case with a GIL. So with this "better" GIL its actually near-perfect in this _simple_ case. (Like the presenter said, there needs to be more tests to study the behavior under heavier tasks)

What you're describing, with the threaded being "twice as fast" assumes that the threads will allow two python-instruction-bound tasks to run concurrently, which would require having no GIL.

An improvement over the old implementation is better. "Fixed and Perfect" would be free-threading, where it would be twice as fast with threads.

I'll take incremental, concrete improvements anytime.

If you really want to speed up your program, you have to parallelize and distribute the workload over multiple cores. Python has a module in the standard library for this, it's called 'multiprocessing'. See http://docs.python.org/library/multiprocessing.html
It won't help if you have to do stuff over a single shared data structure.
Well you can of course. It is just a pain to do it.
That's a good point. I rarely had to do it and, when I could, I opted to distribute the data.

Still, Python needs better threading.

From Slide 7: Sequential (Single Core) : 24.6s

Threaded (Dual Core) : 45.5s

Threaded (With a CPU core diabled) : 38.0s

23.5 / 24.0 is an improvement to both tests - and a significant improvement to the Threaded one.

Does anyone know why multiple kernel threads are used at all? If only one thread can run at a time, why not just use a single kernel thread?
Because CPython doesn't want to invent it's own scheduler (and various other things that your operating system already does for you), further multiple threads can be active at the same time if they're doing any operations that release the GIL (such as I/O), Python can't know whether your thread will do any I/O before you create it.
However, the approach explained in this presentation does just that. The solution looks like a crude process operating systems scheduler.
It's not really, it's simply threads asking the main thread to drop every 5 ms, and then relying on the OS to schedule them. If Python had it's own scheduler it would do things like assigning priorities and actually deciding which thread would take over when the GIL was dropped.
Just to expand on the previous reply: a lot of C libraries have a blocking model where you call a function that blocks until it is complete. The simplest example is fread(). If the interpreter doesn't support multiple threads, you have to do a lot of gymnastics to make sure you never block the interpreter in an extension.

It can be done (this is how Ruby pre-1.9 works), but people complain about it a lot. Matz's rationale for why he added native OS threads to Ruby 1.9 was "people seem to like them."

In response to my fread() example you may be tempted to say that you can just put the underlying fd in non-blocking mode with fcntl(2). This is unfortunately unsafe (as I recall), though I cannot find the reference right now. Accessing the fd directly with read() and write() is non-blocking mode is of course safe, but then you have to do your own buffering.

Also, while poll() and select() give you most of what you need to make your I/O non-blocking, I recall a case where a pipe will be write-ready, but if you try to write too much data at the same time the write will block anyway, instead of returning a short count of bytes written. This was on RHEL3 Linux, so things may have changed since then.

I recall a case where a pipe will be write-ready, but if you try to write too much data at the same time the write will block anyway, instead of returning a short count of bytes written

That just sounds like a bug.

No, it's the specified behavior of a Unix pipe or socket[1]. If you want it not to block, you should set the O_NONBLOCK flag using fcntl(). Being "write ready" just means that it can take at least one byte in a buffer; it can't possibly be a promise not to block if you feed it a 2G buffer.

[1] But not disk files. Local disk storage isn't considered "blocking" if it simply needs to do a disk seek before returning the data. Most unix geeks end up finding this out experimentally at some point in their careers and using strong language. No, I don't know what they were thinking either...

In the SuS description of poll(), it says that POLLOUT means "Normal data may be written without blocking." However it does not say how much data. It is unexpected that a write() call would block instead of returning a short count; for example, if you feed it a 2G buffer, it could return 4096 indicating that only 4k of the 2G was actually written.
It could, but it doesn't and never has. The standard was written to admit a broader set of behavior than real systems actually implement. And in this case, there's absolutely nothing "surprising" to my eyes about a blocking (!) socket blocking on overflow.

Again: there is already a mechanism in place to support non-blocking behavior, and it's not this one.

If you want it not to block, you should set the O_NONBLOCK flag using fcntl().

Certainly, I had assumed that the OP was talking about poll over non-blocking pipes.