Hacker News new | ask | show | jobs
by mytherin 1700 days ago
DuckDB developer here. DuckDB is a regular RDBMS that has persistent ACID storage, but is tuned towards analytical workloads, i.e. read-heavy workloads with aggregates that require full scans of the data. Any data you write to tables is stored persistently on disk, and not all your data needs to fit in memory either.

Our tagline is “SQLite for analytics”, as DuckDB is an in-process database system similar to SQLite that is geared towards these types of workloads.

DuckDB has a flexible query engine, and also has support for directly running SQL queries (in parallel!) on top of Pandas [1] and Parquet [2] without requiring the data to be imported into the system.

[1] https://duckdb.org/2021/05/14/sql-on-pandas.html

[2] https://duckdb.org/2021/06/25/querying-parquet.html

1 comments

Maybe this is a silly question: Why is the A/B choice between a row-major database and a column-major database, instead of between row-major tables and column-major tables within a flexible database?

What's stopping the other leading brands from implementing columnar storage, queries, and such with a COLUMN MAJOR table attribute?

Some databases do offer both, but it is much more involved than just changing the storage model. The entire query execution model needs to adapt to columnar execution. You can simulate a column store model in a row-store database by splitting a table into a series of single-column tables, but the performance benefits you will capture are much smaller than a system that is designed and optimized for column store execution.
SQL calculations on columnar data are quite different from row-based databases, so its effectively a different database engine. You can take multiple advantages of columnar data store, because it usually employs a form of vocabulary compression. For instance, obtaining distinct values of a field in a columnar DB is much faster because it's typically just the vocabulary of the field, so it doesn't even require a full table scan. Many other columnar computations such as filtering or aggregation can be done on compressed data without decompression.