| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by vallode 739 days ago

The slides from the Spark AI 2020 summit [1] helped me understand this a bit. If I get it correctly, the premise is that a specific format is used to organise information into efficient blocks where related column data "lives" closer together, enabling faster read speeds but worse write speeds.

If someone has more resources on the topic, I'd be very interested. There are many applications where sacrificing data freshness for a considerable uptick in performance is alluring.

[1]: https://www.slideshare.net/slideshow/the-apache-spark-file-f...

2 comments

marcosdumay 739 days ago

> enabling faster read speeds but worse write speeds

Write speeds will probably decrease, because the main organization is extremely optimized.

But the goal is to speed-up queries of the type of "select sum(price) - sum(tax) from orders" at the cost of queries of the type of "select * from orders where id = 1".

link

wanderinglight 739 days ago

This lecture series on columnar storage formats and querying them is great. https://youtu.be/1hdynBJo3ew?si=5KfT_2qpUFQmy_uL

link

vallode 739 days ago

Thank you!

link