Hacker News new | ask | show | jobs
by mbostock 853 days ago
If you generate Apache Parquet files you can use DuckDB to make range requests and not download everything to the client. This is pretty magical and allows you to have surprisingly large datasets still queryable at interactive speeds.

But the general idea is to not send everything the client — to be more deliberate and restrictive in what you send, and also what you show. So you probably shouldn’t use this for a general-purpose pivot table that’s trying to show “everything” in your data warehouse and enable ad hoc exploration. You’d instead design more specific, opinionated views, and then craft corresponding data loaders that generate specific pre-aggregated datasets.

2 comments

I’ve been searching for alternatives to crossfilter.js for slicing large multidimensional datasets at sub 10ms speeds. Would DuckDB-Wasm be a suitable alternative?
It's not always clear which pushdowns are available in DuckDB. For instance, while x = y has been available for a while, x in (y, z, ...) hasn't. The DuckDB team seems very eager and motivated to get all the pushdown functionality working though, so hopefully becomes a non-issue soon (perhaps already in 0.10.0).

Another way to use DuckDB if you're warehouse supports it would be e2e Arrow (no col->row->col overhead).

    warehouse --> ADBC --> arrow --> DuckDB
Of course this differs in that you would be reading from the warehouse directly, but in my experience fully pre-aggregating data and then staging it (keeping parquet files on S3 up to date) might solve one issue, but results in unimaginable issues. Perhaps the sweet spot might be something like Iceberg in the middle?

    warehouse --> iceberg table (parquet) --> DuckDB

> craft corresponding data loaders

Do you have an example of this using DuckDB? I'm very interested in seeing an actual implementation of Observable's data loaders combined with DuckDB (or any other SQL DB)

edit: nm, found it. https://observablehq.com/framework/lib/duckdb

edit2: eh, I didn't even realize who i was responding to, lol. The more I read into this, the more I see how this is all heavily based on static files. So the static parquet files thing makes more sense, the solution I added makes little. Although I guess you could add a static iceberg table and interact with its manifest with the duckdb iceberg extension.