| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tschellenbach 2709 days ago
	Our in-house DB at Stream also runs on top of RocksDB + Raft. Its amazing just how much faster it is than anything else out there (especially compared to cassandra). Instagram uses rocksdb as storage for Cassandra, Linkedin and pinterest use rocksdb. As soon as you have the time to build your own db using rocksdb you get really finegrained control over performance. https://stackshare.io/stream/stream-and-go-news-feeds-for-ov...

1 comments

stingraycharles 2709 days ago

Rocksdb is pretty good and we relied heavily on it at QuasarDB as well. Having said that, we are nowadays deploying more and more production setups with Levyx’ Helium, which scales better and directly integrates with the hardware.

link

m0zg 2709 days ago

Given that Helium appears to be proprietary, what kind of perf benefit are we talking about here?

link

jandrewrogers 2709 days ago

I haven't used Helium specifically, but 3-5x greater throughput would be completely believable in my experience. It is an open secret that high-performance closed source storage engines can have several times the throughput of their open source equivalents on the same hardware. High-end storage engines often have sufficient throughput to consistently saturate NVMe arrays for diverse workloads, which is not something you commonly see in open source. Consequently, it is common to see closed source storage engines for people doing high-scale sensor analytics work and similar.

The source of this performance gap is architectural. The current design of RocksDB precludes it ever being legitimately high-performance in most contexts, and most other open source storage engines use a similar design. Modern high-performance storage engines also use a common architecture implemented in minor variations, you just don't see this architecture in open source much. I realize that few software engineers have the skillset and experience required to design a top-notch storage engine, but I am still surprised by the dearth of open source examples given the large value in closing this gap.

I rarely use open source storage engines in the systems I build for this reason. The CapEx/OpEx implications of using them is far too costly at scale. Fortunately, I have the approximately free option of using my own storage engine implementations.

link

ryanworl 2709 days ago

This technique is just more IO parallelism at the physical layer due to higher concurrency while submitting IO, correct? Since NVMe and new SSDs don't hit peak throughput until very high queue depths this doesn't surprise me.

link

jandrewrogers 2709 days ago

I/O parallelism is necessary but far from sufficient. My own designs arbitrarily allow 64 reads and 64 writes to be in-flight concurrently per core. There is no science behind that limit beyond the fact it has worked brilliantly for many years across every type of storage. But I/O parallelism won't fix terrible scheduling.

A fast storage engine needs to eliminate most of the elements that will stall an execution pipeline. This means doing things like almost completely eliminating shared data structures and context switching. It also means designing your own execution and I/O scheduler to greatly reduce the various forms of stalling on memory ubiquitous in many designs. It is difficult to overstate the extent to which thoughtful schedule design can greatly improve throughput.

A state-of-the-art storage engine can drive 2+ GB/s per core, and schedule things to keep the storage hardware performance close to theoretical while smoothing out transients. It is very easy to run out of storage bandwidth in my experience.

link

ccleve 2709 days ago

I'd love to learn more. I'm in need of a fast engine.

What proprietary engines do you know of that I can look at?

Do you have more details on your own designs? Anything you can share?

link

hyc_symas 2708 days ago

You obviously haven't used LMDB. It's zero-overhead reads can saturate NVMe and even Optane DIMMs, something that no other DB engine has accomplished.

link

infinite8s 2709 days ago

What are your thoughts on LMDB?

link

stingraycharles 2709 days ago

In our testing, it’s multiple times faster, especially at scale. RocksDB’s compaction becomes a bottleneck fairly quickly when put under strain for extended periods of time.

Helium performs much, much better at scale and doesn’t have compaction issues. It’s proprietary, but in my experience it’s money well spent.

For the record, we were able to fully saturate a 4xNVMe with a 96 core server using Helium, while RocksDB achieved about 20% of the full NVMe capacity.

As with all benchmarks, YMMV.

link

ddorian43 2709 days ago

Did you try LMDB ?

link