|
Not much, but in our setup the image is not something which can evolve or change over time. This practice has some very practical reasons though. Scientific applications can be very picky about the libraries they use or need, down to minor version since the results they produce are very, very precise. Even if not very accurate, you need to know the inaccuracy. An optimization in a math library can change this and, it's not something we want. Also program verification and certification generally includes versions of the libraries used. Piecewise upgrades are a no go too. Your cluster generally can't work well in heterogeneous configurations (due to library mismatches) and draining a node is not a straightforward task (due to length of the jobs). If your cluster has a steady stream of incoming jobs, reducing resources also means queue bloat and recovering it is not easy sometimes. If you want to drain the whole cluster, it takes almost 2-3 weeks so, you lose ~1 month of productivity. When you start an empty cluster to churn its queues, its saturation takes time so, it doesn't go to 11 directly. Also, worker nodes are highly isolated from the user's point of view. No users can log-in, only known people submit jobs, etc. Unless there's a rogue academic trying to do nefarious things, the place is pretty safe and worry-free. In past 15 years, we got two rootkit infections due to a server which can be world-accessible by design. Other than that, nothing ever got infected. At the end of the day, this approach has some valid reasons to be alive. It's not that we're a bunch of lazy academics who refrain from applying good system administration practices. :D Addendum: The images generally get updated when new hardware is added, since new processors tend to work better with newer kernels. Also sometimes we bit the bullet and update all the cluster at once. XCAT helps a lot in this space. If your image is sane, you can install batches of 150+ servers in 15 minutes while sipping your coffee. |