Hacker News new | ask | show | jobs
by fvdessen 1407 days ago
you need to split those jobs into smaller ones that read their parameters from a queue. Then it will fit in serverless and also be more reliable
2 comments

I'm not sure why this would be more reliable. But it would probably fit, but at the cost of additional complexity.
When you split up into smaller jobs, you have to design them to work in face of retries and parallel execution. It's a bit of complexity, but the end result is a scalable and self-healing system, that can handle lives code updates, features which contribute to make the full workflow inherently reliable and scalable.

If you have a big >1h job you have to add locks, make sure deploys don't interrupt the job, handle retries of the whole job, maintain serverless + not serverless, and then inevitably rewrite the whole thing when it takes too long to be viable. All in all a lot of work and complexity as well that is wasted on making a bad design work.

60+ minute jobs are already complex.
And much harder to maintain and understand...
We're doing that with cloud functions, pubsub and pulumi, the infra code to set that up is trivial, and it is actually a lot easier to maintain since it's fully serverless & you get retries and parallelism 'for free'. With cronjobs on vms the job itself might be a bit easier to code, but everything around it is a lot harder. (What happens if your 5h job crashes in the middle, who restarts it ? How do you manage locks to prevent concurrent execution ? How do you prevent that job from overloading the system ? etc ...)

just to clarify our setup: - 1 pubsub 'job' queue - 1 cloud function triggered by a scheduled event populates the job queue - 1 idempotent cloud function to handle a job, triggered by events on the queue.