|
|
|
|
|
by khuey
21 days ago
|
|
The "CPU mutex" is just the cache coherency mechanism. If you shard your data to avoid triggering it as suggested, then yes, it's much faster. EDIT: or maybe you're asking if introducing an explicit userspace mutex is better than a lockless algorithm with false sharing issues. The answer is that it's workload dependent but it definitely can be. |
|
OP > The issue is this will likely go just as slow if not slower. The mere act of sharing the same 64-byte region of memory (a.k.a. cacheline) between multiple cores, causes the CPU internally to basically use a mutex, and chances are the CPU's internal mutexes aren't as good as the ones you've implemented in userspace.
The claim by OP is that "chances are" that userspace mutexes are better than CPU's internal mutexes. So either h/w guys are (for a first) lagging s/w folks and using outdated approaches to creating a mutex in hardware, OR, we somehow must use an inferior approach when implementing a mutex in a CPU, OR, ..
How is it possible that a hardware implementation of an algorithm could be slower than its software variant, and that in "userspace" and not even the kernel.