| Following the BottledWater-Pg link through to the Confluence blog [0], there's a hilarious illustration of what we're doing to ourselves. The first flow chart [1] is simple. It shows a user, an app, and three separated data volumes that each serve a separate use case (db, cache, and index [assuming for OLAP workloads]). The chart is headlined with the adamant imperative "Stop doing this". Instead, Confluence suggests that we start doing this [2]: user -> app -> DB -> extraction -> Kafka -> (Index <-> Cache <-> HDFS) -> monitoring -> samza. Ehhh, no thanks. I like the other option. We need to understand that good engineering is not about making more work for ourselves. It's about simplicity and elegance, and being able to accomplish complex tasks WITHOUT wrapping ourselves up into some intractable mega-contorsion. More moving parts means more fragility and more waste. Simplicity means beauty, power, and flexibility. Now, I'm not suggesting that such architectures are never justified. I just want to highlight that the complexity should be eschewed, not celebrated. If you find yourself writing a blog post that converts a simple 3-step process into a complex 5-step, 9-destination process, alarm bells should be ringing, and you should be talking about why your organization (see Conway's Law) and/or the state of computer science sucks so bad that the 3-step process isn't good enough. [0] https://www.confluent.io/blog/bottled-water-real-time-integr... [1] https://cdn2.hubspot.net/hub/540072/file-3062873213-png/blog... [2] https://cdn2.hubspot.net/hub/540072/file-3062873223-png/blog... |
The problem of "I have to get a large portion of the DB into service X" is one I've worked on, so the initial solution is more fragile. It doesn't deal with back pressure. If a service goes down, it "loses" writes and must be resynced from a good state. If for whatever reason data science sets up a HDFS cluster I need to push writes there from my app.
With the second method - I don't have to use all those services - and while I'm not given the same latency guarantees I can be more sure that a user's given change will eventually end up in every service that cares about that given change.
Sure if you only need to write to one DB, the Confluence method is overkill - however if that solution works for you, I'd imagine you haven't hit the volume and the latency requirements that would require you to seek out a solution like Confluence's anyways.