Hacker News new | ask | show | jobs
by p10jkle 742 days ago
I wrote two blog posts on this! It's a really hard problem

https://restate.dev/blog/solving-durable-executions-immutabi...

https://restate.dev/blog/code-that-sleeps-for-a-month/

The key takeaways:

1. Immutable code platforms (like Lambda) make things much more tractable - old code being executable for 'as long as your handlers run' is the property you need. This can also be achieved in Kubernetes with some clever controllers

2. The ability to make delayed RPCs and span time that way allows you to make your handlers very short running, but take action over very long periods. This is much superior to just sleeping over and over in a loop - instead, you do delayed tail calls.

2 comments

> Immutable code platforms (like Lambda) make things much more tractable

My job is admittedly very old-school, but is that actually doable? I dont think my stakeholders would accept a version of "well we can't fix this bug for our current customers, but the new ones wont have it". That just seems like a chaos nobody wants to deal with.

I don't personally believe this immutability property should be used for handlers that run for more than say 5 minutes. Any longer than that, I'd suggest the use of delayed calls, which explicitly will serialise the handler arguments instead of saving the whole journal. I agree executing code that is even just an hour old is unacceptable in almost all cases.

Obviously you can still sleep for a month, but I really see no way to make such a handler safely updatable without editing the code to branch on versions, which can become a mess really quick (but good for getting out of a jam!)

ah! this took me a second to grok, but from #2 above: "we just want to send the email service a request that we want to be processed in a month. The thing that hangs around ‘in-flight’ wouldn’t be a journal of a partially-completed workflow, with potentially many steps, but instead a single request message."

I'll have to think through how much that solves, but it's a new insight for me - thanks!

I like that you're working on this. seems tricky, but figuring out how to clearly write workflows using this pattern could tame a lot of complexity.

It's always been a lively topic within Restate. The conversation goes a bit like this

> Let users write code how they want, its our job to make it work!

> Yes, but it's simply not safe to do this!

I think we need to offer our users a lot of stuff to get it right:

1. Tools so they know when a deploy puts in-flight invocations at risk, or maybe even in their editor, showing what invocations exist at each line of a handler

2. Nudge towards delayed call patterns whereever we can

3. Escape hatches if they absolutely have to change a long-running handler - ways to branch their code on the running version, clever cancellation tricks, 'restart as a new call' operation

Sadly no silver bullet. Delayed calls get you a lot of the way though :p