Hacker News new | ask | show | jobs
by MauranKilom 867 days ago
Two big question marks after reading this and the linked home page (partially pointed out in other comments):

- If there's a flaw in your application code that causes a crash (as is the motivating example in the essay), then restoring the entire program into the state it was in just before the crash happened would just cause it to crash again ad infinitum. Sure, this model helps against "my VM instance got preempted", but that's a pretty different category of "crash" (and also notably unrelated to supervision trees).

- "External" state (like an API endpoint being down/returning gibberish) can be part of the reason why your program got into a bad state. In fact, that's disproportionately likely, since external "weirdness" is comparatively hard to cover exhaustively in tests. In such a situation, the suggested computation model would never be able to recover even when restarted, because it would forever retain the (bad) API response. Effectively, this is just caching all the non-pure effects of your computation, and we all know about cache invalidation being a hard problem...

6 comments

Original article is now ironically crashed, but: my last job was working on a point-of-sale system which used this kind of append-only transaction system and a "crash and reboot on failure" model. Every button press got turned into one or more transactions. This had the nice property that if something failed, most of the time it was before anything got written, so the system would reboot and leave you back on the screen before your last button press. The state could also be shipped over to the developer's PC so you could repro that state under the debugger. There had to be a "detach account" getout clause for cases where the sequence of transactions caused a crash on load, which was rare but possible.

The hardest part was of course managing external state and journaling exactly where you had got to with external transaction APIs. Further backend reconciliation was available to flag this (and avoid Post Office scenarios).

Note that French NF525 almost mandates this design, at least for point-of-sale systems: every financial transaction has to be durably written for tax auditing purposes.

Hi! I'm the author of the essay.

Durable execution is meant to complement your application. You will never want to model everything with it. It solves the problem of needing to decide how often to manually make snapshots of some important state, this becomes implicit. Workflows in flawless can still fail, you could call the `panic` function, or divide by zero. In the end it's arbitrary compute.

"External" state is one of the text book examples for using durable execution. If you are interacting with 5 different services and calling 5 different API endpoints, you sometimes want to have transactional behaviour. Leave all 5 systems in a consistent state after your interaction. You can't only call 2 and stop. Durable execution and patterns like saga [1] are one of the most straight forward ways (for me) to solve this.

In flawless specifically, I try to give enough context to the user why things failed. It's very easy to reconstruct the whole computation from the log. And let the user decide if they want to re-run the workflow. If you charge someone's credit card, but the call to extend their subscription fails (service down), you can't safely just re-run this. You have two choices, either you continue progressing and roll back the charge, or you fail and have someone manually look at it. In general, you want to use flawless in scenarios where the "called exactly once" guarantee is important. If you can just throw away the state and it's safe to re-run from the start, then you don't need flawless for this part of the app. The less state you have to care about, the better.

EDIT: The alternative would be to manually construct a state machine with a database. "Check if the credit card was charged. Call Stripe. I finished charging the credit card, save this information. Call the subscription service, it failed, restart everything. Check if the credit card was charged ...". And depending on your workflow, this can be a very complicated process where 90% of your code is just dealing with possible failures. Especially if failures happen on the edge of some calls it can become very tricky.

[1]: https://medium.com/cloud-native-daily/microservices-patterns...

I feel like this approach might still pose some challenges or issues with regards to time or stale data. A couple of problematic scenarios:

- Application requests a JWT token. It then crashes and gets restarted. It gets past the problematic point, but later when trying to make a request, it crashes due to the cached token being expired.

- Application interacts with the current time in a meaningful manner. Due to the log replay, it will always live in the past and when switching from the cache-sourced time to the current time, some issues might occur, like deltas being larger than expected

- Application goes through a webshop checkout flow. After restart, some of the items in its cart have been already sold, but the app doesn't check this, since it already went through a (cached) check and got the result

Funnily enough, this is actually a massive problem when working with cloud automation APIs. Terraform and the like kinda handle this problem by calculating / storing the “goal state” and then looking at the system’s current state, and coming up with a “plan” to reconcile it.

Unfortunately, cloud provider APIs are usually eventually consistent, and getting a full snapshot at scale is nigh impossible.

So, in order to work around this, I effectively built a write ahead log style atop Postgres. Something like Sagas would have been great, but as far as I can tell, there was no real pattern for multiple Sagas operating on global state doing coordination. This is where Postgres SSI came in handy, where I could read the assumed state of the system, and if another worker came in and manipulated it, the write ahead entry wouldn’t get written as the txn would fail to commit. The write ahead entry would then get asynchronously processed by another worker, in case the first worker failed.

This sort of architecture often shows up in actors, e.g. https://www.microsoft.com/en-us/research/publication/orleans...

In that world, what you're generally looking at is

local state + incoming message -> new state + outgoing messages

Outgoing messages are sent only after persisting, and will be retried until success. Unique message IDs are used for idempotency (also for incoming messages).

Important: the actor runs with no side effects -- that's what makes rerunning things safe. In that kind of world, side effects are often achieved via e.g. sending a message that updates a materialized view somewhere (see e.g. Kafka).

With this setup, the source of badness is often isolated to the incoming message, and after a few failures an incoming message can be moved into a "dead letter queue" for ops to look at. In many scenarios, this actually works remarkably well.

https://pmatseykanets.github.io/beanstalkd-docs/protocol/#bu...

> then restoring the entire program into the state it was in just before the crash happened would just cause it to crash again ad infinitum.

Hopefully!

But, sadly, this is not always the case. It's sad because it's incredibly hard to debug crashes that don't happen deterministically.

About your first point, in a Beam app with a supervision tree you don't necessarily need to restore your entire app state or the state "it was in before", you can restart with just a "workable" state.
> cache invalidation being a hard problem

For those unaware: "There are only two hard things in computer science: cache invalidation and naming things." Plus variants like, "Oh, and off-by-one errors."