Hacker News new | ask | show | jobs
by Znafon 1038 days ago
Yes this is a thing was trivially protected by the GIL. There is the same thing with mutating the same map concurrently in Go that will panic for example.

PEP 703 goes over this in the "Container Thread-Safety" (I think container here refers to the fact that the object has references to other objects, this is the things that already are special-cased in CPython to be managed specifically by the garbage collector):

> This PEP proposes using per-object locks to provide many of the same protections that the GIL provides. For example, every list, dictionary, and set will have an associated lightweight lock. All operations that modify the object must hold the object’s lock. Most operations that read from the object should acquire the object’s lock as well; the few read operations that can proceed without holding a lock are described below.

More information at https://peps.python.org/pep-0703/#container-thread-safety

2 comments

Note that concurrency of containers not specific to Python any way.

For example, Java implements different versions of containers for single thread and multithread usage, because multithreaded containers have obvious performance penalty

https://docs.oracle.com/javase/8/docs/api/java/util/concurre...

> For example, Java implements different versions of containers for single thread and multithread usage, because multithreaded containers have obvious performance penalty.

Very few codebases in Java are single-threaded; specialty frameworks like Netty are the exception not the rule.

Likewise, there are not different containers for single threaded and multithreaded usage; there are containers that have different strategies for dealing with multi-threaded usage.

Hashtable is the oldest and is notable for it still being present and still being fundamentally flawed. It will lock reads and writes, but does not describe a way to lock for a transaction - e.g. changes based on read data. As such, it fundamentally has race conditions you can't protect against.

Hashmap and the rest of the Java 1.2 collections API set a slightly better pattern - they don't internally try to maintain safety, but provide mechanisms like synchronizedMap() to let you hold the monitor for the length of your transaction.

However, this could only be so good, because the monitors in Java are pretty fundamentally broken as well. A monitor is both part of its public API ( e.g. "synchronized(foo) {...}" ) and part of its implementation (e.g. public synchronized void bar() { ... })". This means that external code can affect your internal operation if you leverage the monitor that you get by default through your "this" instance.

As such, synchronized set involves three monitors:

1. The monitor on the interface-implementing collection type itself, e.g. on the HashMap. This likely is never used.

2. The monitor on the object returned by the 'synchronizedXXX' wrapping method. This is used to protect transactional access, such as iterating through while removing items.

3. The monitor used as a mutex inside the object returned by the 'syncronizedXXX' wrapping method, protect the integrity of the collection data type if used by multithreaded code which does not hold monitor #2. The code may have a race condition, but it won't put the collection itself into an inconsistent structural state.

The 'synchronizedXXX'-returned wrapper objects are pretty expensive, and if you can you should just internalize those collections into business object that does any needed syncronization itself.

ConcurrentHashMap and the like are lockless, and are built with the idea that you can perform the changes needed through atomic operations rather than transactions. This isn't always true, but often is.

For a collection which is always held by a single thread, the atomic operation overhead may still cause a performance impact - after all, the atomic operations are still processor state synchronization points. It is also possible to beat ConcurrentHashMap with regular HashMap on certain usage in multithreaded environments, when you are properly protecting access to the HashMap yourself.

It might be challenging to find scenarios where ConcurrentHashMap doesn't beat the 'synchronizedMap()' wrapper, just because the implementation itself is really expensive.

Thank you! That makes sense, and it also explains why removing the GIL has a negative performance impact as discussed in other comments. Taking a lock every time a container is accessed is significant overhead, which is why languages like C++ don't make basic containers thread-safe.
Effectively the GIL is incurring that overhead on every data structure whether you need it or not.

But with it removed, you'll have to think about it in your designs more than currently. History shows us this is not easy.

> Effectively the GIL is incurring that overhead on every data structure whether you need it or not.

Not really. The GIL is taken and released quite infrequently (only when the Python interpreter decides it's time to do a context switch), whilst the new locks for each data structure are taken/released every time you do a basic operation on those data structures.

Holding a lock that is rarely taken/released incurs very little overhead.

Hm, that’s a fair point, it’s only some of the same overhead because it’s not per operation.