Hacker News new | ask | show | jobs
Ask HN: What are best patterns for events in distributed transactions?
42 points by olivermrbl 1275 days ago
I am amid researching events/messaging patterns to use in a microservice architecture with distributed transactions. The goal is to reliably process events published by multiple services - potentially with different data stores.

We've already decided to use Sagas to orchestrate the distributed transactions, and now I want to find an event framework that complements Sagas well.

For context, we are building an open-source commerce engine striving to be highly extensible. We provide developers with the primitives to create sophisticated commerce experiences and allow them to replace the different domains (PIM, Inventory, etc.) with their own microservices. Hence, maintaining a great developer experience is paramount to us in technical decision-making.

It seems the industry standard is to use a Transactional Outbox Pattern (unless you go all in on event sourcing), but I believe this increases the complexity for external developers building microservices as they would need to understand the outbox pattern and expose one themselves to ensure the events system of our engine is fully functional.

I've been working on a solution that uses a shared cache across microservices to store events in an ongoing distributed transaction - effectively a shared outbox. The transaction orchestrator of the Saga would be responsible for processing the cached events upon successfully committing - or invalidate in case an error is thrown. This solution reduces complexity and boilerplate code for developers. Has anyone heard of a similar pattern? I haven't been able to find anything.

I am curious to learn how others have solved the problem of events in distributed transactions.

8 comments

I'm not sure what your application is, but Temporal [1] could be a good fit. The nice thing is that it can cover all three aspects in one conceptual model: Sagas, distributed transactions, and events.

With Temporal, you write your various complex, potentially long-running actions as normal, synchronous code, which Temporal "magically" turns into distributed, asynchronous, automatically retried, idempotent jobs. The design is really elegant, and removes all of the hard work from writing code that must keep things in sync and handle failures through queueing and retrying. In one sense, Temporal is a way to develop sagas — but also anything else.

Part of the beauty is that you can use it to build the transactional outbox pattern in a much simpler way. You simply emit the event at the end of your workflow. There's no need to make sure the "outbox" is maintained in a database transaction, because Temporal itself is transactional and ensures that the outbox event happens, no matter what failures occur.

Whether that gels with your needs to cater to external developers, I don't know. Are those developers actually working in your codebase, or are you just providing an API?

Either way, you may want to look at Temporal for inspiration, if nothing else. It solves some complex problems in an extremely elegant way. I consider it a must-have in any modern application stack.

[1] https://temporal.io/

We've built something similar at https://www.inngest.com — we allow you to build durable serverless functions that react to events (or run on a schedule).

The good thing about us is that you don't need to learn about events, subscribers, backoffs, idempotency etc. Instead, you write a single line and can deploy reliable event-driven functions to any provider, even at the edge. Most people are afraid of tackling event-driven architectures because it's too hard to set up, debug, build, make reliable, deploy - all the standard stuff that's frustrating.

When I say durable, I mean:

- Functions get their own state, retries, idempotency, throttling, etc.

- You can create steps within functions which have their own retries, similar to Temporal

- You can eg `sleep`, `sleepUntil` within a serverless function to handle queues or delayed jobs

- You can `waitForEvent` within functions to wait for other events which match arbitrary expressions. This lets you do magic things like "when a user adds something to their cart, wait for the checkout event from this user for 1 day. If the event is null/not received, send a reminder email". This is hard to do out of the box, but with us it's built in and takes a single line of code.

- We also store each event so you can do things like local replay, event versioning, schema management & governance, etc.

When you mention "distributed transactions", there's many ways to go. For us, we're working on CDC so that you can atomically react to actual DB transactions. For everything else, durability and retries becomes important.

There's a *lot* to talk about here. I'd love to discuss more in depth - shoot me an email if it's of interest! :)

Thanks for chiming in!

Although I appreciate your detailed walkthrough of Inngest (sounds amazing), I wonder if it is the right fit for us.

One of our guiding product principles is to always build framework agnostic solutions, allowing our users to choose the best tools and providers for their needs. We do this by creating abstractions for each functional area of our domain and element of the stack.

In the context of events infrastructure, developers should be able to use Redis, RabbitMQ, Kafka, and even Inngest interchangeably.

I would love to hear more about your experience with CDC. As far as I understand, this is an alternative to simply polling the outboxes. Will shoot you an email :)

Yep, Inngest does something similar — it can layer over any event stream (NATS, Kafka, GCP Pub/Sub, SQS/SNS, etc.) in order to subscribe to events & trigger functions. We provision that for so you don't need to manage anything when you sign up to us, though it's definitely flexible :)

That said you still need a state store and queue capable of future publishing internally to make things work; if you want to run anything durably, you're going to need to manage state. That's still flexible but a little less so — you need to choose both a queue and a backing state store to make that work.

Have you looked into CRDTs (Conflict Free Replicated Data Types)? Here is a paper on them: https://arxiv.org/pdf/1805.06358.pdf

With CRDTs, it becomes much easier to perform concurrent updates in a distributed system and decouples each replica from each other, since state can always merge to reach the same state.

This Rust crate has good documentation with more explanation (even if you don’t use Rust): https://crates.io/crates/crdts

Take a look, should be clear how they fit in to a distributed event based system w microservices.

I recommend this course:

Distributed data patterns in a Microservice architecture

http://chrisrichardson.net/virtual-bootcamp-distributed-data...

The boring question people tend to overlook is "How do I do database backup and restore?".

If you did everything with distributed transactions, and had a single data store that supported it, you could have all backups be at the same transaction checkpoint. When you restored, everything would be in a consistent state.

If everyone is managing their own data in different data stores, and just listening to Kafka or similar, you have to deal with the inconsistent state after a DB restore somehow.

I dont have an answer, I'm just curious about something. I thought that the point of an outbox was that it was local, and could therefore be updated atomically along with any local db changes. What would happen if the orchestrator was unable to update the remote shared cache outbox after a local exception/rolled back transaction?
> What would happen if the orchestrator was unable to update the remote shared cache outbox after a local exception/rolled back transaction

To be sure, I understand your question, you are referring to the following scenario:

- Start global transaction

- Create shared cache entry

- Add events to the shared cache

- Local transaction fails

- Orchestrators starts rolling back local transaction

- Attempts to delete shared cache entry <-- THIS ONE FAILS

To ensure we don't have stale events in the cache, I would add a TTL, which is configurable for each job (such that you can have long-living events). So in case the orchestrator fails to delete the cache entry, they would be invalidated at some point by exceeding its TTL.

Do you have any references to learn more? I've looked at [1]

[1] https://learn.microsoft.com/en-us/azure/architecture/best-pr...

Yes indeed. Attaching below a couple of references from my exploration:

Transactional outbox pattern:

- https://microservices.io/patterns/data/transactional-outbox....

- https://softwaremill.com/microservices-101

- https://brandur.org/job-drain

A pattern similar to the one I mention using cache:

- https://masstransit-project.com/articles/outbox.html

On the alternative solution, it's worth stressing, that it does not have to be a cache service. Any key/value store would do.