Hacker News new | ask | show | jobs
by buremba 3477 days ago
We actually have pretty similar architecture and use Presto for ad-hoc analysis, Avro is used for hot data and ORC is used as columnar storage at https://rakam.io. Similar to Slack, we have append-only schema (stored on Mysql instead of Hive), since Avro has field ordering the parser uses the latest schema and if it gets EOF in the middle of the buffer, fills the unread columns as null. We modified the Presto engine and built a real-time data warehouse, Avro is used when pushing data to Kafka, the consumers fetch the data in micro-batches, process and convert it to ORC format and save it to the both local SSD + AWS S3.
1 comments

Are you using Avro because of your own choices or Confluent's toolset (which uses Avro on Kafka)?
We tried Avro, Thrift and Protobuf and Avro was our choice. The schema of collections in Rakam is dynamic and with both Thrift and Protobuf schema evolution is not that easy at runtime. Avro is easier to use in Java and doesn't enforce code generation, the dynamic classes are optimized for performance so it's a better option for us.