I mean, only the threaded version, which is expected. For tons of cases Python without the GIL is not just slower, but significantly slower; "somewhere from 30-50%" according to one of the people working on this: https://news.ycombinator.com/item?id=40949628
All of this is why the GIL wasn't removed 20 years ago. There are real trade-offs here.
Thanks for the link, that's an interesting read. Actually the referenced PyMutex is a good old pthread_mutex_t, the same you'd use in C or C++. But I shouldn't have written so surely. Although uncontested locks are very fast, if the loop is tight enough, adding locks will be significant.
However, PEP 703 specifically points out that performance-critical container operations (__getitem__/iteration) avoid locking, so I'm still highly skeptical that those locks are the cause of the 30-50%.
One of the things this spends some time on that was already obsolete in 2011 is using a pool of locks. In 1994 locks are a limited OS resource, Python can't afford to sprinkle millions of them in the codebase. But long before 2011 Linux had the futex, so locks only need to be aligned 32-bit integers. In 2012 Windows gets a similar feature but it can do bytes instead of 32-bit integers if you want.
If a Linux process wants a million locks that's fine, that's just 4MB of RAM now.
All of this is why the GIL wasn't removed 20 years ago. There are real trade-offs here.