|
|
|
|
|
by ora600
3538 days ago
|
|
According to the article, Yelp had 7 different data sources and similar number of targets. If they wrote a loader for each combination, they'd end up with 49 combinations. Not to mention 7 loaders to write every time they add an app. With Kafka - they just need to connect each thing to Kafka - 14 connectors instead of 49. This is pretty much the scenario Kafka was invented for, and you get stream processing for free:
https://engineering.linkedin.com/distributed-systems/log-wha... |
|
Redshift will create columns (within some restrictions about nested arrays) which generally have to be avoided, however you get the data into redshift, from json. Kafka is a process/time wasteful step in almost every redshift loading scenario, given the current state of AWS services. Test for yourself over a few billion messages at various message sizes from 1k to 1M, if you get the chance.
Kafka is great for a message queue if you can't write to S3 directly or as a buffer to deal gracefully with S3 hiccups, for high frequency throughput to redshift.