|
|
|
|
|
by mattbillenstein
3374 days ago
|
|
I've been building some bits of this although with I think much simpler components -- our event pipeline is nginx+lua -> nsq -> python consumer (transform, etc) -> bigquery I also have nsq_to_file hanging off of here to archive both event formats - raw+bigquery It is surprisingly little code and I'm happy with its performance although we are at a smaller scale. I've never been happy with the hadoop ecosystem or java tooling around it -- it's all designed to be scalable and fault tolerant, but it seems like it's always broken if you actually run one of these things. Regarding the ETL, the thing I haven't figured out yet is what to do with the data that's not events. We do daily/hourly exports of those tables in-bulk, but it's not real-time in the data warehouse. This is mostly ok, but I'd love a magic bullet that let me stream these updates into BigQuery as well. In any case -- nice blog post -- nice to see how others are doing it. |
|