| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by daniele_dll 1373 days ago

A bit from everything: the hashtable leverages the CPU caches and the cachelines, it also uses SIMD instructions to search very quickly the hash, once a bucket in the hashtable is identified, only the relevant bits in the hash are compared, on top of this there is the custom memory allocator, fibers and io_uring is of great help network side, etc.

The multithreading helps to scale the access to the hashtable vertically almost linearly.

The use of the fibers make everything easier because cachegrand doesn't have multiple threads running on the same core fighting for the same resources.

On top of this (and thanks to not having multiple threads fighting for the same resources on the same core), the hashtable uses user-space spinlocks (that are "fake" spinlocks as they can't prevent the kernel to preemept, although there are ways to do it) and the "contention" is spread all across the hashtable. The spinlocks though are only used to change the hashtable (with UPSERT, DEL and Read-Modify-Write operations), they are not required to read from it.

Let's say that thread A, B and C, want to access key1, key2 and key3.

If you an hashtable of 10000 buckets, cachegrand will basically split it in chunks of 14 elements and each chunk will have its own localized lock, so there will be about 714 different possible locks. In general, unless you are writing always to the same key, the writes will be distribuited across all the hashtable hitting different locks so it will not be necessary to wait.

Of course I considered different approches a few years ago when I was researching the hashtable including sharding it across multiple threads and let a thread own a slice of it but decided not to go down this way because instead of having 714 locks, like the case above, you will need to use queues to pass messages between the threads (a bit like Scala does) and "focus" the contention on limited number of memory areas (e.g. if you have 4 threads you will have 4 queues and focus the contention on these 4 queues which is bad). Of course you can use atomic operations to push and pop elements from these queues but it will not be able to scale up as the contention will make this system tremble under heavy load.

cachegrand, thanks to the localized spinlocks, can hit 60 million GET and 27 million SET per second on hardware used for benchmarking.