Hacker News new | ask | show | jobs
by cheez 2223 days ago
I read the post and had a couple of points:

1. Is DuckDB similar to having indexes on each column? Because generally when something is slow, the solution is indexes. I have a 100 GB database which records real time data and is lightning fast because of some minor tuning.

2. The example of STDDEV not be available shows the author's unfamiliarity with SQLite which worries me.

https://docs.python.org/2.7/library/sqlite3.html#sqlite3.Con...

Could very easily have made a similar interface if necessary.

2 comments

Hi, one of the authors of DuckDB here.

ad 1) An index is an additional data structure that takes space and needs maintenance when the data changes. This is especially problematic for realtime data. DuckDBs main data structure and implementation is designed for efficient scanning without needing an additional index ("vectorised columnar").

ad 2) We support STDDEV_SAMP and STDDEV_POP. There is also an interface to define your own functions. We are very aware of how SQLite is doing things and even support their client API. But we also might eventually add their way of adding UDFs from Python.

Indexes don't solve the same problem as columnar because with an index you still have to dereference a pointer, read the page (performing I/O if on-disk), and locate each tuple you want within the page--the tuples you need to aggregate from may be scattered across many pages, and you have to read the whole page into memory even if you only want one column.

With a column store, the values of the same column are stored contiguously so if you're summing a column for example, you can read one or more pages that could consist entirely of values the column you need to sum, sequentially. It's much more efficient for aggregations due to cache locality--much fewer cache misses and I/O operations needed to summarize a column.

The major trade-off is that updates and deletes are more expensive in a column store. But this is usually fine for OLAP workloads.

Unsure if DuckDB implements this in its column store, but another common feature is compression.

A page of field values is a much better structure to compress than a page of tuple values. This is especially true of the sort of denormalized data that is typical in analytical workloads. E.g. dates may only be considered in a period of a few years - this will take much less space to encode than the full allocation for a date data type. Similarly we may have categorical fields with only a few values. Imagine a product category with only 8 values. This can take a set of numbers which may be otherwise stored as an integer, and encode it to 3 bits, plus some space for a dictionary.

Another easy bit of compression to apply to columns is run length encoding. Dates, again, serve as a good example. If the data is sorted by date, you can choose to store a data structure that captures (start index, date value, end index). If you had 1,000 records on the same date, you could encode this with only three likely-integer-sized values, rather than 1,000 repetitions of a single likely-integer-sized value.

Such compression offers several benefits. First, it can reduce data volumes, and allow more query workload to come from RAM, either through DB or filesystem caching, or by an in-memory architecture. Additionally, the smaller the data size is, the faster a scan can be completed. Analytic workloads tend toward queries that require a full table scan (though with a columnstore, it's more a full column scan). If the data size is smaller, then you can stream the entire thing through whatever your query is faster.

Yep, compressibility is another big win made possible by columnar layout, and it looks like DuckDB implements at least some form of column compression.
Ah yes, embarrassing not to remember basic DB architecture :-)

Do you have an example of a workload that benefits from this structure?

Yes, it's geared toward reporting / analytics / data science workloads where the most common read operation is calculating aggregate metrics like count/min/mean/max and histograms. Statistics stuff.
Thank you very much for your patience :-)
Taking stats on a collection of time-ordered set of sensor observations.