| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sonium 1452 days ago
	That's what I thought as well, but now I do have some long-running jobs that exceed GCF's 60min limit. So I'm stuck with docker on Compute Engine, where GCP treats you like a 2nd class citizen as the OP found out.

2 comments

latchkey 1451 days ago

I've worked on systems that did that and it was a huge huge mess, especially as the company grew. When jobs run that long, any failure means that they have to start over again and you lose all that time. Even worse, is that it stacks up. One ETL job leads into the next and it becomes a house of cards.

It is better to design things from the start to cut things up into smaller units and parallelize as much as possible. By doing that, you solve the problem I mention... as well as the problem you mention. Two birds.

link

fvdessen 1452 days ago

you need to split those jobs into smaller ones that read their parameters from a queue. Then it will fit in serverless and also be more reliable

link

sonium 1451 days ago

I'm not sure why this would be more reliable. But it would probably fit, but at the cost of additional complexity.

link

fvdessen 1451 days ago

When you split up into smaller jobs, you have to design them to work in face of retries and parallel execution. It's a bit of complexity, but the end result is a scalable and self-healing system, that can handle lives code updates, features which contribute to make the full workflow inherently reliable and scalable.

If you have a big >1h job you have to add locks, make sure deploys don't interrupt the job, handle retries of the whole job, maintain serverless + not serverless, and then inevitably rewrite the whole thing when it takes too long to be viable. All in all a lot of work and complexity as well that is wasted on making a bad design work.

link

latchkey 1451 days ago

60+ minute jobs are already complex.

link

simiones 1452 days ago

And much harder to maintain and understand...

link

fvdessen 1451 days ago

We're doing that with cloud functions, pubsub and pulumi, the infra code to set that up is trivial, and it is actually a lot easier to maintain since it's fully serverless & you get retries and parallelism 'for free'. With cronjobs on vms the job itself might be a bit easier to code, but everything around it is a lot harder. (What happens if your 5h job crashes in the middle, who restarts it ? How do you manage locks to prevent concurrent execution ? How do you prevent that job from overloading the system ? etc ...)

just to clarify our setup: - 1 pubsub 'job' queue - 1 cloud function triggered by a scheduled event populates the job queue - 1 idempotent cloud function to handle a job, triggered by events on the queue.

link