| HN Mirror

(disclaimer: I work on the Data Pipeline project @ Yelp - but these opinions are my own and not necessarily representative of Yelp)

@jack9 - If I am understanding your point correctly, it is that you could just use S3 directly as the 'unified buffer' and the Kafka part is unnecessary. I'll try to shed some light on why we made Kafka part of this infrastructure.

- Regarding "near real-time" this is our cheeky way of saying we internally haven't set an SLA or a formal definition of the maximum latency in the system for us to declare it 'real time'. Hard numbers wise from the time a MySQL[0] event happens to a transformed version of it being in redshift is roughly in the realm of 10-30 seconds (unofficial and not a guarantee of course, just what I am anecdotally seeing at this point, and we are certain we can bring that down).

- As ora600 points out, we not only have multiple data sources, but also multiple data targets. So it's important to us to reduce the IN*OUT down to IN+OUT. It's important to note that nearly all of these connection points more naturally 'speak' in streaming than batch processing.

- Kafka has been fantastic tech for this use case for us: aside from this s3/redshift OUT connector we generally are dealing with connectors which want to be streaming. In-fact if it were performant to do so we would love to stream message-by-message into redshift as well (it's not).

- We already have a lot of infrastructure for batch processing - specifically a service for bulk loading from s3 to redshift with Mycroft[1] (it's open source[2]), and many systems (especially older systems which wrote to s3 in order to do Map-Reduce processing with MRJob[3], also open source[4]) have been able to use it for this purpose - In many ways we previously had used s3 as our 'unified buffer', but it's not a great solution for stream processing.

- Finally regarding the ETLs:

- - Yep we have them, and we hate them. The Data Pipeline project, in a sense, almost originated as a way for us to get rid of them. That's certainly why my team has been building the Data Pipeline components (we are 'the data warehouse team' in Yelp). They are costly: Each ETL is 'handwritten' (adding a new table is writing a new ETL, granted these are very small classes as we have a decent framework, but it still takes manual effort), this requires a 'Yelp-Main' push (Yelp has talked openly about our efforts to break apart our monolith[5], but it's still a thing and the push process isn't as painless as with our micro services), and inputs from other data sources was challenging as our ETL framework was designed around the 'Yelp-Main' codebase (it's old, cut us slack ;)).

- - A component of the Data Pipeline is what we call the 'Application Specific Transformers'[6] (we call em ASTs for short, much to the confusion of any compiler writers out there) which allow us to apply the types of transformations we typically did in ETLs (outputting both an 'raw id' and the 'encoded id' which is served to the front end, splitting bit-flag ints into separate booleans, and stringifying int enumerations - as well as extra stuff ETLs didn't do like extracting documentation for Watson[6])

It's worth noting as well that the Data Pipeline lives in PaaSTA[7] which is a big improvement from how the legacy components it is replacing were deployed.

All this being said: we thought carefully about meeting all the needs of our systems, but our needs are of course our own. This certainly would be the wrong tool for many other use-cases.

Sorry for slow response - I've been out on PTO and just got back to work today. :)

[0] - https://engineeringblog.yelp.com/2016/08/streaming-mysql-tab...

[1] - https://engineeringblog.yelp.com/2015/04/mycroft-load-data-i...

[2] - https://github.com/Yelp/mycroft

[3] - https://engineeringblog.yelp.com/2010/10/mrjob-distributed-c...

[4] - https://github.com/Yelp/mrjob

[5] - https://engineeringblog.yelp.com/2015/03/using-services-to-b...

[6] - https://engineeringblog.yelp.com/2016/08/more-than-just-a-sc... (Includes some info about the ASTs and Watson)

[7] - https://engineeringblog.yelp.com/2015/11/introducing-paasta-...