|
|
|
|
|
by buremba
3477 days ago
|
|
We actually have pretty similar architecture and use Presto for ad-hoc analysis, Avro is used for hot data and ORC is used as columnar storage at https://rakam.io. Similar to Slack, we have append-only schema (stored on Mysql instead of Hive), since Avro has field ordering the parser uses the latest schema and if it gets EOF in the middle of the buffer, fills the unread columns as null. We modified the Presto engine and built a real-time data warehouse, Avro is used when pushing data to Kafka, the consumers fetch the data in micro-batches, process and convert it to ORC format and save it to the both local SSD + AWS S3. |
|