| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by greggyb 2223 days ago

Unsure if DuckDB implements this in its column store, but another common feature is compression.

A page of field values is a much better structure to compress than a page of tuple values. This is especially true of the sort of denormalized data that is typical in analytical workloads. E.g. dates may only be considered in a period of a few years - this will take much less space to encode than the full allocation for a date data type. Similarly we may have categorical fields with only a few values. Imagine a product category with only 8 values. This can take a set of numbers which may be otherwise stored as an integer, and encode it to 3 bits, plus some space for a dictionary.

Another easy bit of compression to apply to columns is run length encoding. Dates, again, serve as a good example. If the data is sorted by date, you can choose to store a data structure that captures (start index, date value, end index). If you had 1,000 records on the same date, you could encode this with only three likely-integer-sized values, rather than 1,000 repetitions of a single likely-integer-sized value.

Such compression offers several benefits. First, it can reduce data volumes, and allow more query workload to come from RAM, either through DB or filesystem caching, or by an in-memory architecture. Additionally, the smaller the data size is, the faster a scan can be completed. Analytic workloads tend toward queries that require a full table scan (though with a columnstore, it's more a full column scan). If the data size is smaller, then you can stream the entire thing through whatever your query is faster.

1 comments

kyllo 2223 days ago

Yep, compressibility is another big win made possible by columnar layout, and it looks like DuckDB implements at least some form of column compression.

link