Hacker News new | ask | show | jobs
by wskinner 2021 days ago
I learned the same thing while writing a log structured merge tree. Single threaded writes are a must - not only for performance but also simplicity of implementation.

I'm curious what about your use required implementing your own storage subsystem rather than using an embedded key value store like RocksDB.

3 comments

Came to a similar conclusion back in the days when writing a raytracer, and it stopped scaling past 8 or so cores.

Ended up with a system where each thread accumulated results in small buffers, appended pointers to those buffers to a shared "buffer list" which was very fast due to low contention using typical spinlock+mutex combo.

The thread that overflowed the buffer list would then become the single writer by taking on the responsibility to accumulate the results to the shared output image. It would start by swapping in a fresh list, so the other threads could carry on.

The system would self-tune by regulating the size of the shared buffer list so that the other threads could keep working while the one "writer thread" accumulated.

Probably had room for improvement, but after this change it scaled almost linearly to a least 32 cores, which was the largest system available for testing at the time.

The reason for not simply allocating a full output image per thread and accumulate post-render was mainly due to the memory requirements for large output images.

RocksDB has two big limitations that preclude its use for many types of high-performance data infrastructure (which it sounds like the OP's use case was). First, its throughput performance is much worse (integer factor) than what can be achieved with a different design for some applications. Second, it isn't designed to work well for very large storage volumes. Again, easy to remedy if you design your own storage engine or use an alternative one. There are storage engines that will happily drive a petabyte of storage across a large array of NVMe devices at the theoretical limits of the hardware, though not so much in open source.

Another thing to consider is that you lose significant performance in a few different dimensions if your storage I/O scheduler design is not tightly coupled to your execution scheduler design. While it requires writing more code it also eliminates a bunch of rough edges. This alone is the reason many database-y applications write their own storage engines. For people that do it for a living, writing an excellent custom storage engine isn't that onerous.

RocksDB is a fine choice for applications where performance and scale are not paramount or your hardware is limited. On large servers with hefty workloads, you'll probably want to use something else.

agreed w/ andrew. rocksdb is pretty heavy. for streaming logs, something much much simpler yields significant performance improvements specially when tied to the IO+CPU priority scheduling.
What would be some of your choices?
This is a hard question because everything is really dependent on your threading model. One has to start w/ the threading model. At the time I wrote the first line of code in Jan 2019, there wasn't anything that was really amenable to the seastar::future<> / task-based scheduler with truly async IO (enforced by a reactor stall if greater than 500 micros).... so we wrote our own from scratch .... in fact we wrote it many times over, the first version attempted to use flatbuffers atop my old project - https://github.com/smfrpc/smf but the linearization of buffers proved too costly for long running processes which led to the fragmented buffer approach in the blog post mentioned.
Extreme performance (namely, low latency under heavy load) is the principal requirement. I have yet to see anything that can touch my approach, especially under mixed read/write workloads. I am able to write (small, <4k) business entities to disk at a rate higher than the drives themselves are able to write blocks. This is not something that is feasible in any multithreaded storage architecture.

A secondary requirement is extreme simplicity and safety. My entire implementation is written in managed code and can be understood by a junior developer in one weekend. There is not a single line of code in support of a database feature that we aren't actually going to use.

The final requirement is zero external cost to employ this code. If I own my database implementation, Oracle cannot bill me.

The nice-to-have is being able to follow a breakpoint all the way from user tapping a button down into the b-tree rotation condition logic in the database engine. It also makes profiling performance issues a trivial affair. I like being able to see the actual code in my database engine that is causing a hotpath. This visibility is where additional innovation is possible over time.