| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Lyngbakr 1262 days ago
	This has been a game changer for us. When our analysts run queries on parquets using Arrow they are orders of magnitude faster than equivalent SQL queries on databases.

2 comments

RobinL 1262 days ago

Author here! I've actually just written a separate blog post on a similar topic! https://www.robinlinacre.com/parquet_api/

Parquet seems to be on a path to become the de facto standard for storing and sharing bulk data - for good reason! (discussion: https://news.ycombinator.com/item?id=34310695)

link

the_black_hand 1262 days ago

very cool post. thanks

link

BeefWellington 1262 days ago

Were you working off proper data warehouses, or just the transactional db?

I ask because something a lot of people miss here is how much performance you can get from the T part of ETL. Denormalizing everything into big simple inflated tables makes things orders of magnitude faster. It matters quite a bit what your comparison is against.

link

Lyngbakr 1262 days ago

We saw major improvements when we simply wrote full tables from a transactional database to parquet, but also, as you say, modelling the data appropriately produced significant improvements, too.

link

benjaminwootton 1262 days ago

A column oriented database is probably the bigger performance increase. Parquet and a good data warehouse (something like Clickhouse, Druid or Snowflake) will both use metadata and efficient scans to power through aggregation queries.

link