| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fvdessen 1407 days ago
	you need to split those jobs into smaller ones that read their parameters from a queue. Then it will fit in serverless and also be more reliable

2 comments

sonium 1406 days ago

I'm not sure why this would be more reliable. But it would probably fit, but at the cost of additional complexity.

link

fvdessen 1406 days ago

When you split up into smaller jobs, you have to design them to work in face of retries and parallel execution. It's a bit of complexity, but the end result is a scalable and self-healing system, that can handle lives code updates, features which contribute to make the full workflow inherently reliable and scalable.

If you have a big >1h job you have to add locks, make sure deploys don't interrupt the job, handle retries of the whole job, maintain serverless + not serverless, and then inevitably rewrite the whole thing when it takes too long to be viable. All in all a lot of work and complexity as well that is wasted on making a bad design work.

link

latchkey 1406 days ago

60+ minute jobs are already complex.

link

simiones 1407 days ago

And much harder to maintain and understand...

link

fvdessen 1406 days ago

We're doing that with cloud functions, pubsub and pulumi, the infra code to set that up is trivial, and it is actually a lot easier to maintain since it's fully serverless & you get retries and parallelism 'for free'. With cronjobs on vms the job itself might be a bit easier to code, but everything around it is a lot harder. (What happens if your 5h job crashes in the middle, who restarts it ? How do you manage locks to prevent concurrent execution ? How do you prevent that job from overloading the system ? etc ...)

just to clarify our setup: - 1 pubsub 'job' queue - 1 cloud function triggered by a scheduled event populates the job queue - 1 idempotent cloud function to handle a job, triggered by events on the queue.

link