That's the problem: the kind of issues the GIL prevents is notoriously hard to reproduce and debug. It would probably pass 90% of the test suites out there, unless you wrote yours with multi-threading in mind. Same for empirical tests by QA teams - they rarely cause enough load on the system for the bugs to surface. I might be too pessimistic here, but this really looks to me like a disaster waiting to happen...