The locks are becoming more fine-grained in memcached [1], so that should be less of a problem now.
It is possible to remove lock contention on the read path [2] if a concurrent hash table is used. This can be done while using an O(1) eviction policy that outperforms LRU [3].
NovaX: thanks for the interesting references. The point is, is it worth for memcached to avoid the global interpreter lock in the hash table with the number of cores currently deployed machines have? I would expect to see very little contention. The concurrent hash table looks a good idea for memcached, for sure to have a mutex per key would be likely an overkill in terms of memory usage. I'll try to read with care the links you provided, thank you.
As you said elsewhere, the network I/O is the primary bottleneck. There are a lot of different hashtable designs (so per-key locks not required), but fine grained locking of the table/LRU is probably enough. Since an in-app cache has a different perf profile, the latter two links summarize my work.