Hacker News new | ask | show | jobs
by Lyngbakr 1262 days ago
This has been a game changer for us. When our analysts run queries on parquets using Arrow they are orders of magnitude faster than equivalent SQL queries on databases.
2 comments

Author here! I've actually just written a separate blog post on a similar topic! https://www.robinlinacre.com/parquet_api/

Parquet seems to be on a path to become the de facto standard for storing and sharing bulk data - for good reason! (discussion: https://news.ycombinator.com/item?id=34310695)

very cool post. thanks
Were you working off proper data warehouses, or just the transactional db?

I ask because something a lot of people miss here is how much performance you can get from the T part of ETL. Denormalizing everything into big simple inflated tables makes things orders of magnitude faster. It matters quite a bit what your comparison is against.

We saw major improvements when we simply wrote full tables from a transactional database to parquet, but also, as you say, modelling the data appropriately produced significant improvements, too.
A column oriented database is probably the bigger performance increase. Parquet and a good data warehouse (something like Clickhouse, Druid or Snowflake) will both use metadata and efficient scans to power through aggregation queries.