|
|
|
|
|
by mashraf
4377 days ago
|
|
I have never used Samza but have build similar pipelines using Kafka,Storm,Hadoop etc. In my experience you almost always have to do your transformation logic twice one for batch and one for real time and with that setup Jay's setup look exactly like Lamda Architecture with your stream processing framework doing real time and batch computation. Using stream processing framework like Storm maybe fine when you are running exactly the same code for both real time and batch but it breakdown in more complex cases when code is not exactly the same. Let say we need to calculate Top K trending item from now to last 30 mins, One day and One week. We also know that simple count will always make socks and underwear trend for an ecom shops and Justin Bieber and Lady Gaga for twitter(http://goo.gl/1SColQ). So we use count min sketch for realtime and a sligtly more complex ML algorithm for batch using Hadoop and merge the result in the end. IMO, training and running complex ML is not currently feasible on Streaming Frameworks we have today to use them for both realtime and batch. edited for typos. |
|
I think in cases where you are running totally different computations in different systems the Lambda architecture may make a lot of sense.
However one assumption you may be making is that the stream processing system must be limited to non-blocking, in-memory computations like sketches. A common pattern for people using Samza is actually to accumulate a large window of data and then rank using a complex brute force algorithm that may take 5 mins or so to produce results.
One of the points I was hoping to make is that many of the limitations people think stream processing systems must have (e.g. can never block, can't process large windows of data, can't manage lots of state) have nothing to do with the stream processing model and are just weaknesses of the frameworks they have used.