Hacker News new | ask | show | jobs
by 0x008 1226 days ago
> DuckDB is an in-process SQL OLAP database management system

I don’t understand what it means. Can someone explain? I don’t get why they put such a complicated claim with unexplained acronyms on their homepage.

When I shop for a db, when should I consider duck DB compared to for example Postgres or MySQL? Or do they compete with arrow or parquet? To me it’s unclear because they don’t say what they compete against.

5 comments

> in-process

think of sqlite.

> OLAP

think data warehouse. Columnar for analytical workloads.

If you want something "in-process" then you're probably going to have to decide between sqlite and duckdb.

If your workload is

  1) individual fast and frequent read-write operations (OLTP), then you should probably pick sqlite.

  2) massive amounts of read-heavy analytical operations (OLAP), then you should probably pick duckdb.
That's the decision process stated as simply as possible, but obviously there might be other options out there to consider.

Postgres and MySQL are really better suited for out-of-process (shared server) workloads where multiple clients are interacting with the data and resources/compute. They are both row-based OLTP databases, although I do believe Postgres has an option for both table types (HTAP).

Parquet is a file format, just as Avro, JSON, CSV,... are.

Arrow (still grasping this one) is a way that data can be exchanged between systems and processes in such a way that the data is optimized in such a way that doesn't have to go through the extra steps of being shuffled around in memory. For example the data that is returned from a SQL query can be used directly in a Python/Scala/etc dataframe if using Arrow.

I empathize with you about how confusing it all seems, but your curiosity will serve you well. I remember when I was asking these very same questions and it was being led down this road that opened my mind to the world of databases and data engineering.

Google "olap vs oltp" and when you get that, then google "olap vs oltp vs htap".

Or maybe read "The Log: What every software engineer should know about real-time data's unifying abstraction". I know how lame it sounds, but this article really did change the way I think about data.

https://engineering.linkedin.com/distributed-systems/log-wha...

> If you want something "in-process" then you're probably going to have to decide between sqlite and duckdb.

Now THAT is easy to understand. Thank you.

How do you use DuckDB in production for a company? Store SQLite file in something like S3, sync it once per day and run DuckDB with it?
While DuckDB is an exciting and amazing project, I think the world that will open up around it is just as exciting, and these are exactly the kinds of questions that get me excited.

DuckDB is to Snowflake/BigQuery/DataBricks/etc...

what

sqlite is to MySQL/Postgres/Oracle/etc... (let's ignore for the moment that Postgres and Oracle have HTAP modes)

In other words, I don't think DuckDB aims to replace or compete against the big OLAP products/services such as Snowflake, BigQuery, DataBricks. Instead it's a natural and complementary component in the analytical stack.

Of course you'll see in the numerous blogs about how amazing it is for data exploration, wrangling, jupyter, pandas, etc... but personally I think the questions about how it could be used in production use-cases a lot more fascinating.

Data warehouses can become quite expensive to run and operate when you either have to allow

1) front-end analytical applications to connect to them directly to do analytics on the fly, or

2) if you pre-calculate ALL the analytics (whether they're used or not) that are offloaded to a cheaper and "faster" OLTP system.

I'm excited about how DuckDB can sort of bridge these two solutions.

1) Prepare semi-pre-calculated data on your traditional data warehouse. (store in internal table or external table like iceberg, delta, etc)

2) Ingest the subsets of this data needed for different production workloads in to DuckDB for last-mile analytics and slicing/dicing.

DuckDb could either interact with your

1) push-down queries to internal tables via their database scanners (arrow across the wire. postgres_scanner, hopefully more to come), or

2) prune external tables (iceberg, delta, etc) to get the subsets (interact with catalogs) of semi-pre-calculated analytical data on demand. Think intelligently partitioned parquet files on S3.

Last-mile analytics, pagination, etc can all be done within DuckDb either directly on your browser (WASM) or on the edge with something like AWS Lambda. This could and hopefully will result in reducing the cost of keeping data warehouses around to serve up fully pre-calculated analytics to consumers as well as reducing the complexity of your analytics stack/arch.

Do you work on my team? This is exactly how we're using Duckdb with Databricks as the massive data bulldozer and Duckdb as the scalpel.
Definitely not since we use Snowflake, not Databricks. I'd love to hear more about your solution though!
More like -- you have a bunch of parquet/csv files in s3 (data lake/house/shore/party/whatever), and duckdb can query them using sql, from python bindings or via a cli.
It's an interesting question... DuckDB is a library. Just like SQLite is a library.

It's not designed for concurrent queries, clients connecting to a database, etc. I know there will be companies built around that problem space.

If "serverless" database is a thing, is there a category of software that is "production-less"? :)

(The above is a joke, lol.)

https://en.wikipedia.org/wiki/Online_analytical_processing as opposed to https://en.wikipedia.org/wiki/Online_transaction_processing

DuckDB is when you need to do OLAP analysis, and the data fits in a single node (your laptop), but it's too large for plain excel.

technically you can use PG/MySQL/Python+Numpy+Pandas to process those data for that use case as well, but DuckDB does it easier/faster most of the time.

What do frontends for this type of stuff look like these days? I remember one of my first jobs out of college they had wired up an OLAP cube into Excel so that you could import it into a pivot table and arbitrarily slice and dice data by dragging and dropping columns. I thought that was the coolest thing. Is there a modern day non excel equivalent of this? Cube.js?
I'd imagine Looker, PowerBI, Tableau are probably the front-ends you'd most commonly hear about, although I'm not sure DuckDb is supported or acknowledged by the parent companies (yet).
ok That sounds pretty interesting.. i am curious to find out when I would use this over parquet or pandas.
parquet is just a data file format. Both pandas and DuckDB can query parquet files.

It might be confusing that DuckDB does have its own columnar format, but it's more helpful to think it like "just a query engine (or library) for wrangling tables (or data frames)", i.e. same as pandas.

If you are a web-developer, building application that store username, email, password and all that stuff, where you'd have to do multiple row level fetch of the data with a lot of parallel connections then you'd use postgres or the like.

If you are a data analyst, analysing a lot of data, which isn't updated in real time, where you'd have to do joins, aggregates which are usually column wise functions, you'd use something like a OLAP db, where duckdb is great for that!

I wouldn't say compete. DuckDB feels like it fits right between postgres and parquet. I think of it as like SQLite for columnar data. Sql semantics but optimized for operations against columns instead of rows.
> I don’t understand what it means.

It's like Sqlite(OLTP) but for OLAP.

This is still confusing, what do I use this for exactly?
Very, very roughly: OLTP is for dealing with one row at a time (TP = transaction processing; think “handling a sale”). OLAP is for combining many rows and extracting useful information from them (AP = analytics processing; think “figure out how many sales we had of each type of unit last month”). So for OLAP, you get more emphasis on features like joins, grouping and other analysis.
OLAP databases are column oriented and are optimized for querying large amounts of high dimensional data (E.g. many columns). They're usually used for analytics. They don't support some features that OLTP databases have, like transactions.

OLTP databases are your standard database like MySQL, Postgres, etc.

You use an OLAP database if you want to query billions of rows over many different columns. Obviously they can be used for smaller workloads, but I'm exaggerating to show their strengths.