This is a very good point. Large clusters are typically shared resources, this is a big problem for the US forecasts since we have to balance the time/cpu and timeliness of results.
Well production runs on one of two identical machines. This kind of redundancy is not uncommon for weather sites. One of these will run the production version of the forecast and the other will be running version n+1. We flip between the two semi-regularly to ensure if one machine were to have a major issue production would be unaffected if forced to move to the other machine. Smaller weather sites may have a single machine partitioned into two.