| HN Mirror

As noted by another HNer, we are using one cluster and up to 10 nodes as the top end, with only one running 90% of the time and only up to 3 covering the other 9%. We set 10 to the upper limit to ensure we're not going to have borked runs in case some crazy random ML model isn't going to take everything out and force killed jobs. The vast majority of workloads running on the cluster are HIGHLY variable with most running on Monday morning/afternoon and Month or Quarter starts + 3 days.

We are running many hundreds of jobs on those peak days with only a handful running on any other day. While many bring up examples where 24/7 infrastructure from a single box is more than plenty, we find that we can run micro VMs in this configuration and not have to worry about resource contention as our jobs run.

Pre-GKE, we were managing the timing manually, which was fine until we started to scale, but we found this to be a far better situation. Particularly because we simply don't have to think about it.

YMMV.