When you split up into smaller jobs, you have to design them to work in face of retries and parallel execution. It's a bit of complexity, but the end result is a scalable and self-healing system, that can handle lives code updates, features which contribute to make the full workflow inherently reliable and scalable.
If you have a big >1h job you have to add locks, make sure deploys don't interrupt the job, handle retries of the whole job, maintain serverless + not serverless, and then inevitably rewrite the whole thing when it takes too long to be viable. All in all a lot of work and complexity as well that is wasted on making a bad design work.
We're doing that with cloud functions, pubsub and pulumi, the infra code to set that up is trivial, and it is actually a lot easier to maintain since it's fully serverless & you get retries and parallelism 'for free'. With cronjobs on vms the job itself might be a bit easier to code, but everything around it is a lot harder. (What happens if your 5h job crashes in the middle, who restarts it ? How do you manage locks to prevent concurrent execution ? How do you prevent that job from overloading the system ? etc ...)
just to clarify our setup:
- 1 pubsub 'job' queue
- 1 cloud function triggered by a scheduled event populates the job queue
- 1 idempotent cloud function to handle a job, triggered by events on the queue.