| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wenc 684 days ago

Not real time, just historical. (I don’t see why it can’t be used for real time though... but haven’t thought through the caveats)

Also, not sure what you mean by Parquet is not good at appending? On the contrary, Parquet is designed for an append-only paradigm (like Hadoop back in the day). You can just drop a new parquet file and it’s appended.

If you have 1.parquet, all you have you to do is drop 2.parquet in the same folder or Hive hierarchy. Then query>

  Select * from ‘*.parquet’

DuckDB automatically scans all the parquet in that directory structure when it queries. If there’s a predicate, it uses Parquet header information to skip files that don’t contain the data requested so it’s very fast.

In practice we use a directory structure called Hive partitioning, which helps DuckDB do partition elimination to skip over irrelevant partitions, making it even faster.

https://duckdb.org/docs/data/partitioning/hive_partitioning

Parquet is great for appending!

Now, it's not so good at updating because it's a write-once format (not read-write). To update a single record in a Parquet file entails regenerating the entire Parquet file. So if you have late-arriving updates, you need to do extra work to identify the partition involved and overwrite. Either that or use bitemporal modeling (add data arrival timestamp [1]) and do a latest date clause in your query (entailing more compute). If you have a scenario where existing data changes a lot, Parquet is not a good format for you. You should look into Timescale (time-series database based on Postgres)

[1] https://en.wikipedia.org/wiki/Bitemporal_modeling