Hacker News new | ask | show | jobs
by dhruba_b 4607 days ago
Hi guys, I am Dhruba and I work in the Database Engineering team at Facebook. We just released RocksDB as an open source project. If anybody has any technical questions about RocksDB, please feel free to ask. Thanks.
5 comments

Hi Dhruba, thanks for volunteering to ask questions.

What are the big algorithmic ideas behind RocksDB?

My understanding is that LevelDB is based on log structured merge trees. These can be deamortized using methods from Overmars's "The Design of Dynamic Data Structures" or Bender et al.'s "Cache-Oblivious Streaming B-trees". How did you reduce latency?

What else was slowing down databases larger than RAM? How did you fix that?

RocksDB has an LSM architecture, similar in nature to HBase, leveldb, etc. But the implementation is based on a Theorem that we will be publishing shortly. I am working on the Theorem with a colleague of mine.

Cache Oblivious B-trees is an interesting paper. Similarly fractal trees. Most of them optimize the case when index nodes are not in memory. However, in our use-cases, we typically configure the system in such a way that most index nodes are in memory.

For an LSM database, the key component is "compaction". You can ingest data only as fast as you can compact, otherwise u get a unstable system. .1 RocksDB replaced the Level-style compaction of leveldb with UniversalStyleCompaction that has reduced write amplification. This boosts performance. 2. RocksDB implemented multi-threaded write, which means that parallel compactions on different parts of the database can occur simultaneously. This boosts performance. 3. Bloom filter for range-scans: this boost read performance 4. MergeType records that allows higher level objects (counters, lists) use only-write instead of a read-modify-write. Improves performance. 5. And many more...

Can you share with us the statement of that theorem?

What is "UniversalStyleCompaction", and why is it capitalized and missing spaces?

How does a Bloom filter for range scans work? Standard Bloom filters (as you know) are for existence only.

I'm guessing this may have been an early draft of some of the statements of the theorem:

http://webcache.googleusercontent.com/search?q=cache:fTxlRmb...

How much do you think RocksDB/LevelDB performance is impacted by the use of relatively coarse-grained locking? Another LevelDB fork, HyperLevelDB [1] implemented a fine grained scheme with performance benefits. Disclaimer: I am working on a (unreleased) fork of LevelDB that uses hardware transactional memory for synchronization using the new TSX-NI instructions present on Haswell processors.

[1] - http://hyperdex.org/performance/leveldb/

This is an excellent question. RocksDB has a lock per shard that is used to protect incrementing/decrementing refcounts on critical metaddata structures for writes. It is never held during actual IO to storage. We have been able to saturate a flash device for a pure-write workload when value size = 800 bytes without bottlenecking on this lock.
Can this be used as a drop-in replacement for LevelDB on queue technologies like ApolloMQ that use LevelDB as the default?
This can be used as a drop-in replacement. It disk format is compatible with leveldb. But once you upgrade to RocksDB, you won't be able to switch back to Leveldb (unless you restore ur data from a backup).

One definite use-case for RocksDB is a message queue that has a high number of puts/gets/deletes.

Also, please look at options.h to determine what options to tune. A not-tuned system will not show you much perf difference from leveldb.

Thank you. My company uses ApolloMQ but in a relatively low-data setting. (10-15 queues, with less than 5000 messages in a queue at a time) I don't know that it would make it worth it to move to a different data store but the idea was fascinating to me so I was curious. Thanks for the response.
https://github.com/facebook/rocksdb/issues/new is a 404.

Appreciated when project is on github and open for issues/PR, etc.

are you not able to access the github repo?
doesn't compile on centos 6, gcc 4.6.3 wanted to file a issue. but 404 stopped me.
Any replication support? (Or any sort of distribution?)
The primary enhancements over LevelDB seem to be parallel compactions of disjoint ranges, to take advantage of cheap seeks on flash storage, and the ability to parameterize core algorithms and data structures to suit a particular anticipated workload. All very cool; anything else major?

Also, there aren't JNI bindings... are there?

Thanks for the contribution. Just started using LevelDB on a project, but deployment will involve fast flash storage and rocksdb looks like a worthy successor.

Jon

Thanks for your comments Jon. RocksDB shares some of its genes with LevelDB.. something like a parent-child relationship.

Please check out Universal Comaction Style, multi-threaded-compaction, pipelined memtables.

I used to have JNI bindings that I pulled in from https://github.com/fusesource/leveldbjni but it was difficult for me to update the JNI everytime we added new apis to RocksDB. It would be great if somebody who needs Java support can implement JNI bindings for RocksDB.

Can it be configured as a distributed No SQL database like Cassandra?