Hacker News new | ask | show | jobs
Drivetribe’s Modern Take on CQRS with Apache Flink (data-artisans.com)
83 points by wints 3389 days ago
7 comments

I like this architecture a lot but I don't know how "modern" it is. Event sourcing around a message bus is relatively common.

In fact you don't even need that many moving pieces. Akka persistence does most of this out of the box: http://doc.akka.io/docs/akka/current/scala/persistence.html

Blue green deployments are a good addition here that many systems don't seem to be able to handle but the article doesn't really get into the details here.

It nice to see documentation on a real world implementation of this architecture, and if anyone is interested in more detail, I'd suggest this book[1] on my to-read list (it details a very similar approach).

I've done SOA, and n-tier DDD, and I'd love a chance to build something like this, but as an enterprise developer, I have to operate quite a bit behind the curve (13 years later, and 90% of my co-workers have never heard of Evans' book). Still, selling new approaches to management gets easier if you can point to exiting production systems like this, so, thanks for posting this wints.

[1]https://www.manning.com/books/functional-and-reactive-domain...

Honestly this sounds like overcomplicating a system just for the sake of it. Drivetribe can be implemented reliably in Django or Rails without any of these fancier "architectures". But then considering that the product itself is not innovative, perhaps they needed something to sell to their naive recruits.
Hi. I am the author of the article. Thank you for spending time to read this. The combined reach of the co-founders is very large, thus being able to provably handle scale was an essential prerequisite. Additionally, the requirements of the platform extend way beyond a simple content server. Content performance is tracked in real-time and this is fed to multiple ranking and recommendation models. Those frequently change, thus we need a way to retroactively process our data. Flexibility is key when trying to build an intelligent platform. Thus, we decided to early-on invest time in the ability to quickly iterate and experiment on algorithms, in real time over live data. You are right that the API fleet could be implemented using the aforementioned technologies; We use Scala and thus decided to use Akka HTTP instead. The challenging part is how you manage state behind that.
Don't get me wrong, I am not denying that CQRS/stream processing style approach is not useful for any application. Rather it is unsuitable for this particular problem.

In my experience all these features sound nice on paper. But you quickly run into practical issues that are far easier when you know approximate information about the state.

E.g. Developing a model? you might just want a subset/batch data. Doing BI/Analytics? are you going to continuously tax your server to recompute? The argument about recommender systems is also honestly flimsy, having built and applied such systems to live traffic at very large scale (more than hundreds of millions of users). There is only a small advantage from being able to quickly reconfigure flows. In most cases you have a single baseline model which you compare against for a small fraction of the traffic. The real complexity/gains in recommender systems lie in choice of algorithm/hyper-parameters/features, not on continuous multi armed bandits with 1000 different models applied simultaneously while waiting an infinite amount of time to produce any statistically meaningful answer. In fact for a website like this one, recommender systems can only provide so much advantage.

There are actually several really good specialized use cases, e.g. Google secmon-tools uses a system like this one.

[1] https://web.stanford.edu/class/cs259d/lectures/Session11.pdf

You mention the word "batch" when talking about models. Also "BI/Analytics". Since Django/Rails applications do not support any of the two, another sort of system would be needed. This is the point where, having built everything on Django, with no foresight whatsoever about future requirements, we would have ended up creating DataFrames from SQL tables in Spark. Our BI guys have no experience with Spark, so we would need to load data to a DW-like solution, like BigQuery/Redshift/Impala/Presto/you-name-it. Instead of another sink in Flink, we would need to implement and schedule ETL jobs. Even at our current load, computing counters (eg likes) at read time would be slow and inefficient. Which means we would need a way to pre-aggregate them. Maybe another service, possibly behind a queue? You can see where I am going. As requirements evolve, systems evolve, and with no planning before hand, people end up with spaghetti architectures. We knew we were funded enough to run for a couple of years. We knew the site would have traffic. We were tasked with delivering an algorithmicly-driven product, and this is the solution we came up with.

I really do not understand how such a strong set of conclusions can be drawn out of so little information.

Can you share some details around numbers and volume? "Very large" does not really convey why going down this route makes sense.
Unfortunately, I am not allowed to. The problem with this is that beforehand you cannot predict the volumes. 1K requests per second? 10K per second? Maybe 50K per second on special occasions? It is difficult to tell, especially when high profile personalities are involved.

PS: we do have lots of load

The main point is that this new architecture has many interesting advantages over the standard Rails & database approach.

By logging all the interaction events and computing application state via a stream processor, you get all the benefit's of a log-centric architecture. Building different services as "views" over the history of interactions, forking the stream to test new applications, replaying data to test different recommender models, ramping up models on historic data and falling in with real-time data for continuous updates. That all becomes the same thing.

And for many use cases, you get better performance on less hardware because you have a simpler concurrency model.

My thought exactly. The article doesn't give any numbers on how many customers, messages, "tribes", queries per sec, or anything. This market just doesn't sound very big, and it's probable that a plain old CRUD app would handle it just fine.
Honestly, so what? Something got built that wasn't the x-millionth rails app/clone and it made it into production. That's a good thing. Your comment seems to advocate never try anything new.
That's possible. But it's also possible that the product could be huge and complicated later on, in which case they have laid a very strong and capable groundwork, and we should be applauding their foresight and avoiding the creation of technical debt.
Event Sourcing + CQRS is an extremely scalable, and flexible system, but at the same time it can overcomplicate things. I'm not sure it was required here.

Interesting to see it used outside the .Net world.

"This is fairly straightforward; professional software developers of all levels usually have some experience building and scaling stateless APIs."

I would question this, in fact suggest the complete opposite. Given that they say they're a team of Scala developers, I think there is definitely some selection bias going on -- in some subset of the wider population of developers that might be the case, but certainly not globally.

I'm not clear on why you would use redis to get transactions. Wouldn't sending a single message to kafka that represents the whole transaction with acks=all have the same effect? Then instead of pulling from redis like it is a queue, you can just do that from kafka.
This is exactly what we did initially. Really early on (before any rate limiting was in place), a few spam accounts followed 100K people, created lots spam content etc. Encapsulating those deletions started yielding messages bigger than the default max kafka message size (1MB). Additionally, this method had a few side effects on the downstream processors. We could of course increase the limit, but we decided to deal with the problem at its core.
Seems over complicated to me. Why do they need so many components?