| Some experiments I did during my coursework for my M.Sc. indicated that it was actually worse-than-useless at the time (2009?) Here's what would happen: - When you've got one CPU core, the threads basically just act like a multiplexer. One thread runs for a while, releases the GIL, and the next thread runs for a while. Not a big deal. - When you've got multiple CPU cores, you've got a thundering herd. When the lock is released, the threads waiting on all of the other cores all try to acquire the lock at the same time. Then after one thread has run on, say, core 3, it's gone and invalidated the cache on the other cores (mark & sweep hurts caches pretty badly). The thundering herd stampedes again and the process continues. - To make matters even worse, each core runs at low utilization (e.g. a quad core machine, each core runs at ~25%). If you've got CPU throttling turned on (which my laptop, where I started the experiments, did), then the system detects that the CPU load is low and scales down the clock speed. Normally, this would result in increased CPU utilization, which would speed the CPUs back up again. Unfortunately, the per-core utilization stays pegged at 25% and things never speed back up again. The system looks at it and says "huh! only 25%! I guess we've got the CPU speed set properly!" Maybe it's gotten better since then? I haven't checked recently. Edit: I wish I had the results handy. The basic conclusion was that you got something like a 1.5x slowdown per additional CPU core. That's not how it's supposed to work! Using taskset to limit a multi-threaded Python process to a single core resulted in significant speedups in the use cases I tried. |
[0] http://www.dabeaz.com/python/NewGIL.pdf