That's what I thought as well, but now I do have some long-running jobs that exceed GCF's 60min limit.
So I'm stuck with docker on Compute Engine, where GCP treats you like a 2nd class citizen as the OP found out.
I've worked on systems that did that and it was a huge huge mess, especially as the company grew. When jobs run that long, any failure means that they have to start over again and you lose all that time. Even worse, is that it stacks up. One ETL job leads into the next and it becomes a house of cards.
It is better to design things from the start to cut things up into smaller units and parallelize as much as possible. By doing that, you solve the problem I mention... as well as the problem you mention. Two birds.
When you split up into smaller jobs, you have to design them to work in face of retries and parallel execution. It's a bit of complexity, but the end result is a scalable and self-healing system, that can handle lives code updates, features which contribute to make the full workflow inherently reliable and scalable.
If you have a big >1h job you have to add locks, make sure deploys don't interrupt the job, handle retries of the whole job, maintain serverless + not serverless, and then inevitably rewrite the whole thing when it takes too long to be viable. All in all a lot of work and complexity as well that is wasted on making a bad design work.
We're doing that with cloud functions, pubsub and pulumi, the infra code to set that up is trivial, and it is actually a lot easier to maintain since it's fully serverless & you get retries and parallelism 'for free'. With cronjobs on vms the job itself might be a bit easier to code, but everything around it is a lot harder. (What happens if your 5h job crashes in the middle, who restarts it ? How do you manage locks to prevent concurrent execution ? How do you prevent that job from overloading the system ? etc ...)
just to clarify our setup:
- 1 pubsub 'job' queue
- 1 cloud function triggered by a scheduled event populates the job queue
- 1 idempotent cloud function to handle a job, triggered by events on the queue.
It is better to design things from the start to cut things up into smaller units and parallelize as much as possible. By doing that, you solve the problem I mention... as well as the problem you mention. Two birds.