| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by MrPowers 1201 days ago

This is cool, but would like to give some higher level context about querying JSON files.

JSON is a row based file format. It doesn't allow query engines to skip rows or skip columns when running queries, so all data needs to get read into memory. That's really inefficient.

Column based file formats allow for query engines to skip entire columns of data (e.g. Parquet). Parquet also stores metadata on row groups and allows query engines to skip rows when reading data. These performance enhancements can speed up queries from 0x - 100x or more (depends on how much data is skipped).

Data Lakehouse storage systems abstract the file metadata to a separate layer, which is even better than storing it in the file footer like Parquet does.

This DuckDB functionality is cool, but I think it's best to use it to convert JSON files to Parquet / a Lakehouse storage system, and then query them. JSON is a really inefficient file format for running queries.