| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Plough_Jogger 3482 days ago
	We are implementing a very similar architecture, and have decided to use Avro for schema validation / serialization, rather than Parquet. Does anyone have experience with both that can talk to their strengths / weaknesses?

3 comments

samkone 3482 days ago

Avro is Row oriented like said before, you should see it ine the categories of Thrift, Protobuf. Albeit a lot better in flexibility. But he gist of it is that it's a Serialization format for than a storage format, which Parquet is. Usually, when using Kafka or the confluent platform, I'd use Avro, and for long term storage and analytics Avro isn't really suited. Instead use Parquet or ORC if you're using Hive. With things like Spark, Impala or Presto, aggregations queries for ad hoc analytics are an order of magnitude more efficient and faste with Parquet than with Avro.

link

buremba 3482 days ago

Parquet is a columnar storage type whereas Avro is row-oriented serialization framework. If you have lots of columns and want to perform ad-hoc analysis, Parquet will be better than Avro due to the mechanics of the columnar storage types.

link

maxnevermind 3482 days ago

Parquet may consume less space because it uses encoding enhancements like delta encoding, run-length encoding, dictionary encoding. Also large number of tools that support Parquet as a format when Avro is Java and Hadoop centric.

link

andrioni 3482 days ago

The other way around: Avro is supported by pretty much any language out there, while you can't even write a Parquet file on Python, and even reading it is pretty hard.

link