| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by amluto 37 days ago

Not a specific link, but a DuckDB table is a bag of rows (logically — it’s physically arranged as a column store), and those rows are not ordered in a way that is expressed in the schema. If you do a big analytic query, DuckDB will (extremely efficiently) scan the whole table and will blow many other tools out of the water while doing so. But if you want to see the sensor value of a specific sensor at a specific time, you want an index of some sort, not a full table scan. And if you want to do a rollup of some but not all sensors, you end up modifying some stuff in the middle of a table, which is not amazingly efficient. DuckDB has an optional index, but I don’t think it’s meant for this.

You could certainly create a directory with a Parquet file for each (entity id, time range), and you could probably convince the DuckDB query engine to understand that (using Ducklake? raw Hive can only barely do this), but I don’t think that DuckDB will binary search for you. (And binary search is actually pretty lousy for this use case.)

Clickhouse has explicitly ordered tables:

https://clickhouse.com/docs/engines/table-engines/mergetree-...

1 comments

pjot 37 days ago

DuckDB has indexes. One specifically for highly selective point lookups. The structure is called an adaptive radix tree. The workloads you’re describing are sort of duckdb’s bread and butter (even the parquet stuff it’ll handle out of the box)

https://duckdb.org/2022/07/27/art-storage

link