Hacker News new | ask | show | jobs
by nitrogen 4018 days ago
Having load average exceed core count isn't necessarily a sign of an imminent cascade failure. IIRC it's just a count of the number of processes that could run, but are waiting for a timeslice. A particular server and application might be perfectly fine with a load average that is 10 times the number of cores, as long as the average remains stable and the server is meeting response requirements.
1 comments

Actually since you have no idea about what is causing the load (it can be wait on network IO) this is why I think that running your production system in that shape is not recommended. Out of curiosity in what situation is it ok to have significantly more things waiting to be running than your actual capacity? Seems like a bad capacity planning to me. Anyways, this is how it was done (keeping the normalized load around 1) in my previous gig where we had ~5000 nodes and it was working fine. I work on Hadoop clusters nowadays and any time we run into a load of 100+ there is a severe degradation in the service, timeouts etc happen. In reality high normalized load over time (not talking about 1 minute spikes) should be avoided, this is based on my experience.
But a load of 100+ is considerably more than a normalised load of 1, unless we're talking about 100+ core machines.