Hacker News new | ask | show | jobs
by mattbillenstein 3374 days ago
I've been building some bits of this although with I think much simpler components -- our event pipeline is nginx+lua -> nsq -> python consumer (transform, etc) -> bigquery I also have nsq_to_file hanging off of here to archive both event formats - raw+bigquery It is surprisingly little code and I'm happy with its performance although we are at a smaller scale. I've never been happy with the hadoop ecosystem or java tooling around it -- it's all designed to be scalable and fault tolerant, but it seems like it's always broken if you actually run one of these things.

Regarding the ETL, the thing I haven't figured out yet is what to do with the data that's not events. We do daily/hourly exports of those tables in-bulk, but it's not real-time in the data warehouse. This is mostly ok, but I'd love a magic bullet that let me stream these updates into BigQuery as well.

In any case -- nice blog post -- nice to see how others are doing it.