Hacker News new | ask | show | jobs
by manimino 1385 days ago
TFA appears to be about adapting SQLite for OLAP workloads. I do not understand the rationale. Why try to adapt a row-based storage system for OLAP? Why not just use a column store?
3 comments

SQLite is significantly better at OLTP and being a blob strorage than DuckDB, and it doesn't want to sacrifice those advantages and compatibility if OLAP performance can be improved independently. In my experience for diverse workloads it is more practical to start with a row-based structure and incrementally transform it into a column-based one. Indeed in the paper there is a suggested approach that trades space for improved OLAP performance.
It is certainly possible to have a single system that can effectively process high volumes of OLTP traffic while at the same time performing OLAP operations. While there are systems that are designed to do one or the other type of operation well, very few are able to do both. https://www.youtube.com/watch?v=F6-O9v4mrCc
It seems like one idea in there is to store it both ways automatically (the HE variant)! That might be better then manually continually copying between your row store and your column store.
Great discussion here. As one of the co-authors of the paper, here is some additional information.

If you need both transactions and OLAP in the same system, the prevalent way to deliver high performance on this (HTAP) workload is to make two copies of the data. This is what we did in the SQLite3/HE work (paper: https://www.cidrdb.org/cidr2022/papers/p56-prammer.pdf; talk: https://www.youtube.com/watch?v=c9bQyzm6JRU). That was quite clunky. This two copy approach not only wasted storage but makes the code complicated, and it would be very hard to maintain over time (we did not want to fork the SQLite code -- that is not nice).

So, we approached it in a different way and started to look for how we could get higher performance on OLAP queries working as closely with SQLite's native query processing and storage framework.

We went through a large number of options (many of them taken from the mechanisms we developed in an earlier Quickstep project (https://pages.cs.wisc.edu/~jignesh/publ/Quickstep.pdf) and concluded that the Bloom filter method (inspired by a more general technique called Look-ahead Information Passing https://www.vldb.org/pvldb/vol10/p889-zhu.pdf) gave us the biggest bang for the buck.

There is a lot of room for improvement here, and getting high OLAP and transaction performance in a single-copy database system is IMO a holy grail that many in the community are working on.

BTW - the SQLite team, namely Dr. Hipp (that is a cool name), Lawrence and Dan are amazing to work with. As an academic, I very much enjoyed how deeply academic they are in their thinking. No surprise that they have built an amazing data platform (I call it a data platform as it is much more than a database system, as it has many hooks for extensibility).