Hacker News new | ask | show | jobs
by smbrian 2223 days ago
The pitch is faster, and more space efficient since column stores are far better for analytics than row stores. Some benchmarks that found ~5-10x speedup: https://uwekorn.com/2019/10/19/taking-duckdb-for-a-spin.html

Consider someone who analyzes medium-sized volumes of (slowly-changing) data -- OLAP, not OLTP. People who need to do this primarily have 2 alternatives:

* columnar database (redshift, snowflake, bigquery)

* a data lake architecture (spark, presto, hive)

The latter can be slow and wasteful, because the data is stored in a form that allows very limited indexing. So imagine you want query speeds that require the former.

Traditional databases can be hugely wasteful for this usecase -- space overhead due to no compression, slow inserts due to transactions. The best analytics databases are closed-source and come with vendor lockin (there are very few good open-source column stores -- clickhouse is one, duckdb is another). Most solutions are multi-node, so they come with operational complexity. So DuckDB could fill a niche here -- data that's big enough to be unwieldy, but not big enough to need something like redshift. It's analogous to the niche SQLite fills in the transactional database world.

2 comments

I read the post and had a couple of points:

1. Is DuckDB similar to having indexes on each column? Because generally when something is slow, the solution is indexes. I have a 100 GB database which records real time data and is lightning fast because of some minor tuning.

2. The example of STDDEV not be available shows the author's unfamiliarity with SQLite which worries me.

https://docs.python.org/2.7/library/sqlite3.html#sqlite3.Con...

Could very easily have made a similar interface if necessary.

Hi, one of the authors of DuckDB here.

ad 1) An index is an additional data structure that takes space and needs maintenance when the data changes. This is especially problematic for realtime data. DuckDBs main data structure and implementation is designed for efficient scanning without needing an additional index ("vectorised columnar").

ad 2) We support STDDEV_SAMP and STDDEV_POP. There is also an interface to define your own functions. We are very aware of how SQLite is doing things and even support their client API. But we also might eventually add their way of adding UDFs from Python.

Indexes don't solve the same problem as columnar because with an index you still have to dereference a pointer, read the page (performing I/O if on-disk), and locate each tuple you want within the page--the tuples you need to aggregate from may be scattered across many pages, and you have to read the whole page into memory even if you only want one column.

With a column store, the values of the same column are stored contiguously so if you're summing a column for example, you can read one or more pages that could consist entirely of values the column you need to sum, sequentially. It's much more efficient for aggregations due to cache locality--much fewer cache misses and I/O operations needed to summarize a column.

The major trade-off is that updates and deletes are more expensive in a column store. But this is usually fine for OLAP workloads.

Unsure if DuckDB implements this in its column store, but another common feature is compression.

A page of field values is a much better structure to compress than a page of tuple values. This is especially true of the sort of denormalized data that is typical in analytical workloads. E.g. dates may only be considered in a period of a few years - this will take much less space to encode than the full allocation for a date data type. Similarly we may have categorical fields with only a few values. Imagine a product category with only 8 values. This can take a set of numbers which may be otherwise stored as an integer, and encode it to 3 bits, plus some space for a dictionary.

Another easy bit of compression to apply to columns is run length encoding. Dates, again, serve as a good example. If the data is sorted by date, you can choose to store a data structure that captures (start index, date value, end index). If you had 1,000 records on the same date, you could encode this with only three likely-integer-sized values, rather than 1,000 repetitions of a single likely-integer-sized value.

Such compression offers several benefits. First, it can reduce data volumes, and allow more query workload to come from RAM, either through DB or filesystem caching, or by an in-memory architecture. Additionally, the smaller the data size is, the faster a scan can be completed. Analytic workloads tend toward queries that require a full table scan (though with a columnstore, it's more a full column scan). If the data size is smaller, then you can stream the entire thing through whatever your query is faster.

Yep, compressibility is another big win made possible by columnar layout, and it looks like DuckDB implements at least some form of column compression.
Ah yes, embarrassing not to remember basic DB architecture :-)

Do you have an example of a workload that benefits from this structure?

Yes, it's geared toward reporting / analytics / data science workloads where the most common read operation is calculating aggregate metrics like count/min/mean/max and histograms. Statistics stuff.
Thank you very much for your patience :-)
Taking stats on a collection of time-ordered set of sensor observations.
I'd love to see how it compares to MonetDB. I've used MonetDB successfully to look at smallish data set (few hundred millions rows) that neither postgres nor sqlite could complete queries at a useful speed on my laptop (at the time an x220 with a spinning disk).

https://www.monetdb.org/Home

The primary difference for analytic workloads is the processing model. MonetDB uses a column-at-a-time processing model which causes large intermediates to be materialized in memory. This significantly increases memory usage, and causes poor performance when data exceeds memory. DuckDB uses a vectorized execution engine, which materializes only small subsets of the columns. This increases cache-locality, decreases memory usage and also improves parallel execution capabilities (although parallel execution is still WIP currently).
DuckDB is developed by the same database research group at CWI that developed MonetDB, so they have applied a lot of the lessons they learned from MonetDB.
Learnings from both Monet AND Vectorwise.