Hacker News new | ask | show | jobs
by xyzzy_plugh 2126 days ago
Parquet is kind of a royal pain in the ass compared to CSV/JSON/plaintext mostly because it uses a ton of Thrift encodings, resulting in mostly terrible/broken implementations anywhere outside of the Java/JVM ecosystem. If you're running Apache <Whatever> then sure, it'll probably be fine, but I'd recommend avoiding it if you start having to go down the rabbit hole of implementing support for things in your language du jour.
1 comments

The Rust and python impl are fine. But I get it, Parquet may not be perfect or optimal or whatever. It works as a simple, typed, columnar format.

We had to pick a single file format recommendation for sending 100GB+ tables on FTP servers or dropbox, scanning terabytes of useless stuff only to grap an key-value pair, and properly reading integer and UTF-8 columns. Turns out, Parquet is practical. Enough for users to start using it instead of CSV. It could be Avro, but it's just not as easy.

> But I get it, Parquet may not be perfect or optimal or whatever.

I actually think Parquet is pretty great in practice, I just have some issues with the sheer volume of abstractions necessary to implement it. I just wish it was anything other than Thrift.

I would probably choose Parquet over anything else, though.