|
|
|
|
|
by polskibus
2709 days ago
|
|
I noticed that RocksDB is used very often in OLTP scenarios. What's the OLAP equivalent of RocksDB in OLTP world? Apache Parquet? Apache Arrow? What would you use these days to create a high performance OLAP/OLHybridP engine ? |
|
As with the C++ STL, the limiting factor is the number of tunable parameters available i.e. the amount of internal architectural flexibility built into the implementation. OLTP storage engines are pretty simple, so a manageable number of behavioral parameters can usually get you within 3x of the throughput of a more targeted design, which is acceptable performance for most workloads that are not ingest-intensive.
OLAP-ish storage engines, on the other hand, are at least an order of magnitude more complex to implement and have many more degrees of freedom depending on the expected data model and workload. There is a lot more data model and workload diversity in OLAP than OLTP, which makes implementing the effective internal architectural flexibility and set of tunable parameters that need to be maintained very unwieldy. If you limited yourself to the number of user-definable tuning and configuration parameters as an OLTP-oriented storage engine like RocksDB, the performance gap between a generalist implementation and a more targeted implementation will be more like 10-100x, which needless to say is huge. This makes the practical applicability of any "general purpose" OLAP storage engine that someone would want to use quite narrow, which diminishes the value of implementing a general purpose engine.
This leads to the current reality that there is a zoo of specialist storage engines for OLAP-ish workloads -- graph, time-series, event processing, geospatial, classic DW, etc. Much more generalist OLAP storage engines that do several of these models could exist in theory but the bar for technical sophistication and complexity is much higher than for OLTP.
Open source projects in particular tend to have a natural ceiling on the number of man-years invested to get an initial implementation of an architecture, which inherently limits the expressiveness of that architecture for software with this complexity.