| HN Mirror

It often surprises me that Parquet is not widely known outside of data engineering circles. Most software developers are still mucking around with CSV for large tabular data, which is absolutely the wrong format. Better developers use sqlite, which is less wrong but still wrong. Postgres is closer to the right answer, but for very large, typed tabular data, Parquet is the way to go.

Parquet is a columnar format that is compressed, typed, efficient for columnar queries, append-friendly (though not rewritable), and is a natively supported format for Apache Spark. Parquet libraries are now widely available for most languages (didn't used to be the case, but now they are). I query Parquet files with DuckDB in Python and it blazes through GBs of data in seconds.

At work, the canonical format is TSV (tab separated values) which despite being human-readable, is huge, inefficient to query and does not support data types.

When I have to work with large TSV files (10GB or larger), I first convert them into 500MB Parquet files. The latter are faster, smaller and less prone to type errors. Because columnar formats like Parquets are indexed, I can do complex operations like joins, window functions, aggregations on them in a performant way, while any similar operation on TSV files will trigger a table scan each time.

I recently ran a Spark job on a very large TSV file which took over 8 hours and timed out. A Spark job on the same data represented in Parquet completed in 5 minutes.