Hacker News new | ask | show | jobs
by aidos 3202 days ago
Ours are actually user generated and the running time of each task is variable (few minutes to an hour). Users can to dump anywhere between 1 and 200 tasks on at a time.

The way we have it set up is:

- simple job queue with RQ (redis)

- monitoring watches the queue and pumps a metric into Cloud Watch (there are a few different types of job and it calculates a single aggregate value for "queue pressure")

- autoscale then sets the desired capacity for a fleet of r4.2xlarge machines (somewhere between 1 and 20)

- the autoscale config protects all those machines from scale-in so they have to be shutdown externally

- each of those machines has a cron on boot that tracks the start time

- this enables a cron to run just before the end of each hour. If that machine isn't doing anything at the time, it will shut itself down

- the machines are set to terminate on shutdown so they die completely

- additionally, we've hacked RQ so that workers that are closer to death will move themselves to the back of the queue more frequently. This ensures that we have a higher chance of not being busy / shutting them down at the end of the hour.