Hacker News new | ask | show | jobs
by Plough_Jogger 3482 days ago
We are implementing a very similar architecture, and have decided to use Avro for schema validation / serialization, rather than Parquet.

Does anyone have experience with both that can talk to their strengths / weaknesses?

3 comments

Avro is Row oriented like said before, you should see it ine the categories of Thrift, Protobuf. Albeit a lot better in flexibility. But he gist of it is that it's a Serialization format for than a storage format, which Parquet is. Usually, when using Kafka or the confluent platform, I'd use Avro, and for long term storage and analytics Avro isn't really suited. Instead use Parquet or ORC if you're using Hive. With things like Spark, Impala or Presto, aggregations queries for ad hoc analytics are an order of magnitude more efficient and faste with Parquet than with Avro.
Parquet is a columnar storage type whereas Avro is row-oriented serialization framework. If you have lots of columns and want to perform ad-hoc analysis, Parquet will be better than Avro due to the mechanics of the columnar storage types.
Parquet may consume less space because it uses encoding enhancements like delta encoding, run-length encoding, dictionary encoding. Also large number of tools that support Parquet as a format when Avro is Java and Hadoop centric.
The other way around: Avro is supported by pretty much any language out there, while you can't even write a Parquet file on Python, and even reading it is pretty hard.