| Ours are actually user generated and the running time of each task is variable (few minutes to an hour). Users can to dump anywhere between 1 and 200 tasks on at a time. The way we have it set up is: - simple job queue with RQ (redis) - monitoring watches the queue and pumps a metric into Cloud Watch (there are a few different types of job and it calculates a single aggregate value for "queue pressure") - autoscale then sets the desired capacity for a fleet of r4.2xlarge machines (somewhere between 1 and 20) - the autoscale config protects all those machines from scale-in so they have to be shutdown externally - each of those machines has a cron on boot that tracks the start time - this enables a cron to run just before the end of each hour. If that machine isn't doing anything at the time, it will shut itself down - the machines are set to terminate on shutdown so they die completely - additionally, we've hacked RQ so that workers that are closer to death will move themselves to the back of the queue more frequently. This ensures that we have a higher chance of not being busy / shutting them down at the end of the hour. |