Hacker News new | ask | show | jobs
by polskibus 2709 days ago
I noticed that RocksDB is used very often in OLTP scenarios. What's the OLAP equivalent of RocksDB in OLTP world? Apache Parquet? Apache Arrow? What would you use these days to create a high performance OLAP/OLHybridP engine ?
4 comments

There is a practical engineering reason why the OLAP equivalent doesn't seem to exist. General purpose storage engines, and this applies to RocksDB, are like the C++ STL in that they provide good average performance across a wide range of common cases but are nowhere close to optimal if you have a well-defined type of data model and workload as your use case. You can always gain an integer factor increase in throughput by designing a less generalist implementation with a similar interface.

As with the C++ STL, the limiting factor is the number of tunable parameters available i.e. the amount of internal architectural flexibility built into the implementation. OLTP storage engines are pretty simple, so a manageable number of behavioral parameters can usually get you within 3x of the throughput of a more targeted design, which is acceptable performance for most workloads that are not ingest-intensive.

OLAP-ish storage engines, on the other hand, are at least an order of magnitude more complex to implement and have many more degrees of freedom depending on the expected data model and workload. There is a lot more data model and workload diversity in OLAP than OLTP, which makes implementing the effective internal architectural flexibility and set of tunable parameters that need to be maintained very unwieldy. If you limited yourself to the number of user-definable tuning and configuration parameters as an OLTP-oriented storage engine like RocksDB, the performance gap between a generalist implementation and a more targeted implementation will be more like 10-100x, which needless to say is huge. This makes the practical applicability of any "general purpose" OLAP storage engine that someone would want to use quite narrow, which diminishes the value of implementing a general purpose engine.

This leads to the current reality that there is a zoo of specialist storage engines for OLAP-ish workloads -- graph, time-series, event processing, geospatial, classic DW, etc. Much more generalist OLAP storage engines that do several of these models could exist in theory but the bar for technical sophistication and complexity is much higher than for OLTP.

Open source projects in particular tend to have a natural ceiling on the number of man-years invested to get an initial implementation of an architecture, which inherently limits the expressiveness of that architecture for software with this complexity.

For analytics workloads, your best bet is using compression techniques that let you do operations on the data without decompressing it. A good example is dictionary encoding a set of sorted string keys so you can preform prefix queries by doing a greater than and less than comparison on the integers instead of examining every string entirely.

Once you’ve encoded the data into large enough blocks, you could use any storage engine and write the encoded blocks into it along with metadata for managing which blocks are a part of what tables and partitions of tables.

You can also just use something like Parquet or ORC, but that’s not going to get you the best performance possible.

(author of the blog post here) I'd second ryanworl's comment that the rabbit hole goes much deeper than just storing things in a column oriented disk or in-memory format like Parquet or Arrow. That's just the first step. To get the best performance you have to have your data in an in-memory format that allows you to compress it efficiently, and then perform many relational operations on the compressed form itself. Another example is Run-length and delta encoding a sorted column of integers, and then building relational operators (e.g. a join) that operates directly on the compressed data.

The best explanation for all the various techniques the go into the data structures and operator designs for OLAP workloads is the survey 'The Design and Implementation of Modern Column-Oriented Database Systems' by Abadi, Boncz, Harizopoulos, Idreos, and Madden: http://db.csail.mit.edu/pubs/abadi-column-stores.pdf

I know there are many techniques that used together give good performance (optimal memory layout, compression, vectorization, etc. etc.), however I'd like to use a package that does a lot of it, same what RocksDB (or SQLite) does for OLTP cases. Is there something like that? If not, what's out there that gives the best foundation for building OLAP functionalities on top of it?
Check out Druid [1], an open-source analytical database with tightly-coupled storage and processing engines designed for OLAP. In particular it implements a memory-mappable storage format, indexes, compression, late tuple materialization, and query engines that can operate directly on compressed data. There is a patch out to add vectorized processing as well, so you should expect to see that show up in a future release.

Its storage format and processing engine aren't designed to be embedded in the same way as RocksDB and SQLite are, but you certainly could if you wanted to, since the code is fairly modular. Or you could use it as a standalone service as it was designed to be used.

[1] http://druid.io/

There's also Clickhouse [1] which seems to scale much better than Druid, and has similar architectural decisions to make it somewhat general as a columnar store for OLAP uses. Cloudflare wrote an article in the past where they compared Clickhouse and Druid and they chose Clickhouse because they could get similar performance on the same workload with 9 nodes in Clickhouse which would require hundreds for Druid. They built all of the DNS analytics at CloudFlare on Clickhouse [2].

Disclosure: I work at Percona, and we've seen a lot of our customers make use of Clickhouse and have begun some of our own services work around it in Consulting. It's now a primary database talked about at our conferences, and we post about it regularly. [3]

[1]: https://clickhouse.yandex/ [2]: https://blog.cloudflare.com/how-cloudflare-analyzes-1m-dns-q... [3]: https://www.percona.com/blog/2018/10/01/clickhouse-two-years...

There is a very good article[1] by one of the Druid committers about Clickhouse/Drui/Pinot that goes into some details on why the Cloudflare tests turned out the way they did.

[1]:https://medium.com/@leventov/comparison-of-the-open-source-o...

That article is better than expected, and it matches my own experience with CH (it was a great match for our use-case, and some of those reasons are in the article; and also, we could have used an inverted index, would one have been available; surprisingly survived w/o it).
Does ClickHouse support fine grained data security (for example role A gives access only to tuples with column X==123)?
No [1]. ClickHouse is a fairly low-level tool. If you need that kind of thing, you build an ACL-aware app on top of it.

[1] https://clickhouse.yandex/docs/en/operations/access_rights/

That's what Apache Arrow is, you had the right choice. That solves the processing component and you can use any number of on-disk formats like Parquet and ORC.

And the hybrid of OLAP + OLTP is usually called HTAP.

Is it possible to insert new tuples to arrow model without rebuilding it from scratch?
No, I am not aware of any storage engine that provides that out of the box. The techniques are very tied into what your query processing engine can do and expects the data to look like.

For example, do you materialize tuples immediately, or do you fully run it through your processing pipeline and not materialize until the end?

Your storage engine and format needs to be at least somewhat involved in answer that question, because you need to know what data to read and when.

Unfortunately most of the systems that build what you're describing are closed source (e.g. Snowflake, Microsoft SQL Server, Vertica, Teradata). There isn't an open-source project that does all of those things.
What about Presto?
Presto is more of a distributed SQL solution: you run it on a cluster of nodes, point them at your storage later, it’s more optimised at querying very large datasets and it’s not built or tuned for high performance (in terms of latency or execution time).
Apache Arrow is your best bet, but it's still very much a new project without a lot of the things you're looking for.
> I noticed that RocksDB is used very often in OLTP scenarios.

My experience with it has been most stream processing in kafka streams ect as local state store.

S3 + ORC|Parquet + PrestoDB works very well