| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bob1029 2021 days ago

More threads (i.e. shared state) is a huge mistake if you are trying to maintain a storage subsystem with synchronous access semantics.

I am starting to think you can handle all storage requests for a single logical node on just one core/thread. I have been pushing 5~10 million JSON-serialized entities to disk per second with a single managed thread in .NET Core (using a Samsung 970 Pro for testing). This includes indexing and sequential integer key assignment. This testing will completely saturate the drive (over 1 gigabyte per second steady-state). Just getting an increment of a 64 bit integer over a million times per second across an arbitrary number of threads is a big ask. This is the difference you can see when you double down on single threaded ideology for this type of problem domain.

The technical trick to my success is to run all of the database operations in micro batches (10~1000 microseconds per). I use LMAX Disruptor, so the batches are formed naturally based on throughput conditions. Selecting data structures and algorithms that work well in this type of setup is critical. Append-only is a must with flash and makes orders of magnitude difference in performance. Everything else (b-tree algorithms, etc) follows from this realization.

Put another way, If you find yourself using Task or async/await primitives when trying to talk to something as fast as NVMe flash, you need to rethink your approach. The overhead with multiple threads, task parallel abstractions, et. al. is going to cripple any notion of high throughput in a synchronous storage domain.

3 comments

agallego 2021 days ago

Indeed. I think you have different saturation points the wider the use cases you hit. One example w/ a single-core (which btw, agreed whole heartedly for io) is checksumming + decoding.

For kafka, we have multiple indexes - a time index and an offset index which are simple metadata. the trouble becomes on how you handle decompression+checksumming+compression for supporting compacted topics. ( https://github.com/vectorizedio/redpanda/blob/dev/src/v/stor... )

So single core starts to get saturated while doing both fore-ground and background requests.

.....

Now assume that you handle that with correct priorities for IO and CPU scheduling.... the next bottleneck will be keeping up w/ background tasks.

So then you start to add more threads. but as you mentioned and what I tried to highligiht in that article was that the cost of implicit or simple synchronization is very expensive (as noted by you intuition)

The thread-per-core buffer management with defer destructors is really handy at doing 3 things explicitly

1. your cross core communication is explicit - that is you give it shares as part of a quota so that you understand how your system priorities are working across the system for any kind of workload. This is helpful to prioritize foreground and background work.

2. there is effectively a const memory addresses once you parse it - so you treat it is largely immutable and you can add hooks (say crash if modified on a remote core)

3. makes memory accounting fast. i.e.: instead of pushing a global barrier for the allocator you simply send a message back to the originating core for allocator accounting. This becomes hugely important as you start to increase the number of cores.