Hacker News new | ask | show | jobs
by Demiurge 220 days ago
I completely agree, real world queries are complicated joins, aggregations, staged intermediary datasets, and further manipulations. Even if you start with a single coherent 650gb dataset, if you have a downstream product based on that, you will have multiple copies and iterations, which also have the reproducible, tracked in source control, and visualized in other tools in real time. Honestly, yes, parquet and duckdb make all this easier than awk. But, they still need to be integrated into a larger system.