Hacker News new | ask | show | jobs
by vallode 739 days ago
The slides from the Spark AI 2020 summit [1] helped me understand this a bit. If I get it correctly, the premise is that a specific format is used to organise information into efficient blocks where related column data "lives" closer together, enabling faster read speeds but worse write speeds.

If someone has more resources on the topic, I'd be very interested. There are many applications where sacrificing data freshness for a considerable uptick in performance is alluring.

[1]: https://www.slideshare.net/slideshow/the-apache-spark-file-f...

2 comments

> enabling faster read speeds but worse write speeds

Write speeds will probably decrease, because the main organization is extremely optimized.

But the goal is to speed-up queries of the type of "select sum(price) - sum(tax) from orders" at the cost of queries of the type of "select * from orders where id = 1".

This lecture series on columnar storage formats and querying them is great. https://youtu.be/1hdynBJo3ew?si=5KfT_2qpUFQmy_uL
Thank you!