Hacker News new | ask | show | jobs
by jetblackio 3154 days ago
Not sure if this is a good place to ask, but how do Apache Arrow and Parquet compare to Apache Kudu (https://kudu.apache.org/)? Seems like all three are columnar data solutions, but it's not clear when you'd use one over the other.

Kind of surprised the article didn't mention Kudu for that matter.

2 comments

I have been working full-time on Kudu since its early development. As others have mentioned, Arrow and Kudu are quite different. Despite the controversial-sounding title of Daniel Abadi's article, his content was actually reasonable and his conclusion in the final paragraph of the article is worth reading. In summary, he acknowledges that in-memory and on-disk columnar formats have different goals and both have their place (Arrow being an in-memory format).

Apache Kudu is much more than a file format - it is a columnar distributed storage engine. One way to think of Kudu is as mutable Parquet, but really it's a database backend that integrates with Impala and Spark for SQL, among other systems. It's fault tolerant, manages partitioning for you, secure, and much more. For a quick introduction to Kudu you can check out this short slide deck I put together over a year ago... it's a bit dated but a good overview: https://www.slideshare.net/MichaelPercy3/intro-to-apache-kud...

For more up-to-date information, follow the Apache Kudu Blog at http://kudu.apache.org/blog/ or follow the official Apache Kudu twitter account @ApacheKudu.

This covers the distinction a bit better. https://www.slideshare.net/HadoopSummit/the-columnar-era-lev...
The agenda slide says Kudu is mutable on disk while Parquet is immutable on disk.
Right on, this is perfect. Thanks!
One quick note to make on this. Kudu is a storage implementation, (similar to Parquet in some ways). Arrow isn't about persistence and is actually built to be complementary to both Kudu and Parquet.

Also note: Kudu is a distributed process. Arrow and Parquet are libraries that can be embedded into your existing applications.