Hacker News new | ask | show | jobs
by tonyarkles 3643 days ago
Some experiments I did during my coursework for my M.Sc. indicated that it was actually worse-than-useless at the time (2009?) Here's what would happen:

- When you've got one CPU core, the threads basically just act like a multiplexer. One thread runs for a while, releases the GIL, and the next thread runs for a while. Not a big deal.

- When you've got multiple CPU cores, you've got a thundering herd. When the lock is released, the threads waiting on all of the other cores all try to acquire the lock at the same time. Then after one thread has run on, say, core 3, it's gone and invalidated the cache on the other cores (mark & sweep hurts caches pretty badly). The thundering herd stampedes again and the process continues.

- To make matters even worse, each core runs at low utilization (e.g. a quad core machine, each core runs at ~25%). If you've got CPU throttling turned on (which my laptop, where I started the experiments, did), then the system detects that the CPU load is low and scales down the clock speed. Normally, this would result in increased CPU utilization, which would speed the CPUs back up again. Unfortunately, the per-core utilization stays pegged at 25% and things never speed back up again. The system looks at it and says "huh! only 25%! I guess we've got the CPU speed set properly!"

Maybe it's gotten better since then? I haven't checked recently.

Edit: I wish I had the results handy. The basic conclusion was that you got something like a 1.5x slowdown per additional CPU core. That's not how it's supposed to work! Using taskset to limit a multi-threaded Python process to a single core resulted in significant speedups in the use cases I tried.

2 comments

This is a known issue in py2. On py2 when running in a multi-core machine it'll run ~1.8x slower (depending on what you are doing) than it'll run in a single-core machine. Python 3.2 ships a new GIL[0] fixing the problem.

[0] http://www.dabeaz.com/python/NewGIL.pdf

Dave Beazley! Yes, it was some of his work that inspired my research. Thanks for the reminder!

Edit: That's a beautiful solution to the problem, too. You're still not going to get a performance boost from multiple cores, but you're not going to have it fall flat on its face either.

Sounds interresting, I'd be glad to see a blog post with the actual use cases and a few graphs with different numbers of cores, and maybe the sources so people can go further.
I'll try to dig it up. I suspect it's sitting in an SVN repo somewhere...

If I recall, I took a stock Python interpreter and instrumented it with RDTSC instructions to do lightweight timestamps on GIL acquisitions and releases.