Hacker News new | ask | show | jobs
by arwhatever 748 days ago
I often hear the argument in favor of event-driven architecture that you can work on one part of a system in isolation without having to consider the other parts, and then I get assigned some task which requires me to consider the entire system operation, now with events that are harder to trace than function calls would have been.

Now when people argue “because decoupling,” I hear, “You don’t get as much notification that you just broke a downstream system.”

9 comments

You need to improve your telemetry to feel the benefits. I can trace all the way through multiple services easily on a simple detailed flame graph in our systems.

https://www.datadoghq.com/knowledge-center/distributed-traci...

Unless you have a single monolith, you’re going to face issues with versioning whether it’s event based or API based. In each case you can usually add new properties to a message, but you can’t remove properties or change their types. If you need that, create a new version.

The author does a lot of videos on the event sourcing topic. Event driven I get. It works well in several applications I’ve helped to build over the last 15 years. But event sourcing? I truly don’t get it. Yeah I get it’s nice in terms of auditing to see every change to an entity and who made it, or replay up to to change x on y date, but that really is a niche requirement.

> I can trace all the way through multiple services easily on a simple detailed flame graph in our systems

I'm not sure what point is being made here. It's good that you can do that - but are you implying that that's not possible in an API-driven system?

There are many examples of event sourcing but perhaps the most classical one is the bank account.

It's not just about auditing, it's also about transactionality and atomicity.

If you want to withdraw $5 from your account, the traditional approach of locking, updating everything, unlocking (or in other words wrapping everything in a transaction) doesn't scale as well as the notion that you just record the transaction (event). Implementation-wise this withdrawal can involve, updating two accounts and updating the audit/account transaction logs. We also want this to scale since our bank has millions of customers all operating more or less concurrently. A distributed log (like Kafka) is easy to scale and easy to reason about. You just insert the transaction record and you have a distributed system that will scale and is easy to reason about.

Another driver/flavour for something like event sourcing is what some might call state-based or state-oriented programming. That is instead of modifying state directly you are synchronizing state via events. This lets you e.g. code state machines around those that can lead, again, to easier to reason about (and test) code.

As far as I know banks in my country use transactions and IBM mainframes, not distributed systems.
I once spoke with some engineers at Monzo and at the time (2018) financial transactions were handled via a combination of Kafka and Cassandra.
Do you know for a fact that banks use event sourcing for transactions? I thought they were extremely eventually consistent
I'm pretty sure different banks use vastly different architectures. Some run nightly batch jobs on mainframes that are written in COBOL. But alas, I've not worked for any bank, this is just a commonly used example. I am willing to bet the transaction log, or ledger, is indeed a very common approach since that's also common in accounting.

EDIT: Also event sourcing would typically be eventually consistent. I imagine for some banking applications a stronger consistency guarantee might be required, e.g. to prevent you from withdrawing the $100 in your account multiple times.

I’m sceptical, because I can’t find any examples of banks using it in production, just lots of blog posts by consultants and companies selling event sourcing solutions.

I did work somewhere that used a Kafka stack in production. It wasn’t a compelling use case and they spent almost an entire year on infra and productionizing it. It left me extremely sceptical about anything “big data streams” related :)

Fair enough. But I've used event sourcing in production in two companies. One project was a large scale distributed object store and the other was network equipment (like switches) and network management. The banking example is a classical one but I can't tell you if it's actually used in production. What I do know is that accounting software follows the ledger approach which has a similar spirit, recording transactions, and my guess would be regardless of technology banks also are transaction/event based as their source of truth (even in a COBOL mainframe batch processing scenario).
> Implementation-wise this withdrawal can involve, updating two accounts

Can't wait to read the incident report when one of the consumers successfully receives and reads the message but the other doesn't.

Depends what domain you work in. Auditability is a key/mandatory requirement in a lot of regulated industries.

There are of course other ways to do auditability.

Event Sourcing + Projections provide a nice way to build multiple models/views from the same dataset. This can provide a lot of simplification for client code.

A niche requirement? There are big accounting firms who organize payrolls, where the changes that you mentioned are an important part of their business.

There are also other companies, which do the typical snapshot and roll up to the current time, when they start the services, that need the data without having access to the database.

You don't need event sourcing to organize payrolls event "at scale".
> I can trace all the way through multiple services easily on a simple detailed flame graph in our systems.

That's not exactly an obscure feature exclusive to datadog. From the top of my head, both AWS and Azure support distributed tracing with dedicated support for visualization in their x-ray and application insights services.

I doubt GP was suggesting it is unique to DD.
I think generally a lot of these types of problems were actually had by folks who grew out of single node systems and had a lot of interesting ideas to solve problems that were new in those domains, GIVEN they've already solved the stateful domain problems as well.

When you've never grown out of a single node domain but you do event driven "because scaling" or whatever, you've shot yourself in the foot amazingly hard.

Yes, events, async, eventual consistency, decoupling represent a difficult/complex solution for some hard problems encountered when scaling high.

But people often forget there are trade-offs to everything and if you don't have these hard problems, you're giving yourself only headaches.

My pet-peeve is "decoupling" - it's treated as holy with only benefits and no downsides. But it's actually again a level of complexity - unless you need it, tightly coupled code will be easier to write, read, debug etc.

Like anything it can be abused and sometimes folks go overboard with turning everything into an event. However, when done right, it is really amazing to work with.

As an event producer as long as you follow reasonable backwards-compatibility best practices then you should be pretty safe from breaking things downstream. As a consumer, follow defensive programming and allow for idempotency in case you need to reprocess an event. Pretty straightforward once you get the hang of things.

> As an event producer as long as you follow reasonable backwards-compatibility best practices then you should be pretty safe from breaking things downstream.

That can protect you from "downstream can't even read the message anymore" but it doesn't help you with the much more common "downstream isn't doing the right thing with the message anymore" problem. Schema evolution is kinda like schema'd RPC calls vs plain JSON: it will protect you from "oops, we sent eventId instead of event_id" type of errors, but won't prevent you from making logical errors. In a larger org, this can turn into delayed-discovery nightmares.

A synchronous API call could give you back an error response and alert your immediately to something being wrong. The system notifies you directly.

A downstream event consumer may fail in ways entirely off of your team's radar. The downstream team starts getting alerts. Whether or not those alerts make it immediately obvious to them that it's your fault... that depends on a bunch of factors.

Data consumers in unknown teams is a nightmare regardless of the architecture.

Events sent for readership you can’t control are ideally of the type «x changed», and the consumer must then fetch data on the relevant endpoint.

That or the company must have serious versioning policies.

What's the benefit of using an asynchronous event driven system if you can't process any of those events without performing a synchronous query back on the same provider for all of the necessary data?
You get relevant notifications without polling or needing to sub, and you don’t have to be strict about message versioning.
Do you have pointers to such best practices? Gratefully received etc.
Design for things to be easily failable. It should be trivial to have failed messages to go to a DLQ and then reprocess them later on, say after a bug fix.

Only make additive changes, don't change existing fields. For enums it's up to the consumer to ensure they don't fail when a new case is added.

Be very careful with including data (especially time/expiry stuff) in the message too. If you need to reprocess the event several hours later then it may no longer work or be stale. Rather than include the data in the message itself, we would include the database ID and then have the consumer query for that entry.

Rich Hickey's talk about "growth" (as opposed to change) of software systems is a good one for this.

Tldr: ok to add things. Not ok to remove things or change things

> now with events that are harder to trace than function calls would have been

I don't know how this could be true. Events are things - nouns which can be backed-up, replicated, stored, queried, rendered, indexed and searched over.

How is it not true? Instead of tracking data and function calls over a unified stacktrace, you track Things and Messages over databases, queues, and logs —- none of which you can trivially attach a debugger to.

I generally like event-driven architecture, but I need to admit that debuggability is sacrificed where it matters most.

There's no "find usages" for events, and it becomes harder to find out why something didn't happen. A function call can't simply "not return" - in the worst case you get an exception, or a stuck thread in the caller that will show up in a stack dump. But downstream event processing can very easily just not happen, for one of many different reasons, and out-of-the-box it's often difficult to investigate.
> A function call can't simply "not return"

Remember "callback hell"? Assumption that a function call returns after running to completion requires rather specific synchronous cascading architecture, which WILL break in multithreaded code. Most of the multithreaded function calls will set a flag in shared memory and return early, expecting caller to poll.

If your API is based on single entry-point `invokeMethod(callee, method)` it is equally untraceable to event entry point `fireEvent(producer, event)`.

> Most of the multithreaded function calls will set a flag in shared memory and return early, expecting caller to poll.

Which is exactly switching from function calls to event-driven architecture, and the problems with that are exactly the problems we're talking about.

The problems you describe are inherent to indirect invocations and are related to event-driven architectures only because typical event dispatching architecture is built on non-blocking calls.

You do not even need return-early (non-blocking) semantics for these problems to manifest. You can implement a giant string-keyed vtable for all methods in your program (or use a language with advanced reflection capabilities) and will have exactly the same problems. Namely there probably won't be tooling to match caller-callee pairs, which is the core issue here.

In JavaScript,

  const
    myEvent = 'myEvent', 
    target = new EventTarget()

  target.on( myEvent, () => {
    console.log( "It's easy to introspect well-organized code." )
  })

  target.dispatchEvent( new Event( myEvent ))
Yeah, good luck remembering to do that up-front for every event handler. You missed one? Whoops, enjoy the information you wanted silently not being there when you need it.
Remembering to do what? Properly maintain a list of constants and enums to use throughout my application?

That's not something I have to remember or forget, it's a simple habit that is as natural as importing and referencing a function.

As a general rule, numbers and string literals should never be hardcoded. Internalizing this should be a base expectation of any high-performing team member.

What I like about event driven is that you don't even need to know if anyone is listening to or cares about your event.

And as a consumer, many independent tasks can be triggered by the same event.

I'm working on a system right now and because of events, it's very easy for me to write a handler for when a certain type of record is created in the database. My feature depends on knowing that new record was made so we can send some emails and do other things.

The people that wrote the code that creates the record, didn't have to do anything to support the feature.

But I agree that it's not the right solution for every problem. But there are certain problems it solves really well.

> you don't even need to know if anyone is listening to or cares about your event.

Right up until you need to change something about the event because the business logic it represents has changed. Then you suddenly need to track down all the systems that have been relying on it, including that one that nobody knows anything about and always forgets exists because some guy decided to implement the service in erlang and nobody who ever touched it even works at the company anymore.

How is that any different for an API-driven architecture? You'd need to track down all consumers of your API you're wanting to make a breaking change to.
I really dislike this argument, because it puts the duty of managing dependencies and requirements directly on code. This is organizational issue!

First, if your event (or whatever) changes enough that there are inter-component breakages it means engineering requirements must have changed and tracing dependencies of requirements is organizational thing.

Second, you either do trunk based development and constantly break downstream or do leaf based development and have constantly out of date core dependencies. In any case, that's release version management, which is again organizational thing.

And that's why a SAGA describes that flow.

Don't take it into consideration and you're fucked.

Source: previous "seniors" didn't take it into consideration, they left

> now with events that are harder to trace than function calls

Same issue as microservices: there are people who want to use the paradigm but not do the investment in monitoring/tooling.

Amen. Event-driven architecture makes it easier to bury your head in the sand, and harder to implement an actually-working feature.
Integration tests?
...happen too late