Hacker News new | ask | show | jobs
by istvan__ 4014 days ago
I don't think that a load of 172 is a good idea. I know this is a benchmark that is measuring how fast you can go ideally but in production the question is how fast you can go with with keeping the latency within the SLA. As a general rule you want to run your boxes around 1 normalized load ( load / # of CPU cores). The rest of the article is pretty nice.
2 comments

Having load average exceed core count isn't necessarily a sign of an imminent cascade failure. IIRC it's just a count of the number of processes that could run, but are waiting for a timeslice. A particular server and application might be perfectly fine with a load average that is 10 times the number of cores, as long as the average remains stable and the server is meeting response requirements.
Actually since you have no idea about what is causing the load (it can be wait on network IO) this is why I think that running your production system in that shape is not recommended. Out of curiosity in what situation is it ok to have significantly more things waiting to be running than your actual capacity? Seems like a bad capacity planning to me. Anyways, this is how it was done (keeping the normalized load around 1) in my previous gig where we had ~5000 nodes and it was working fine. I work on Hadoop clusters nowadays and any time we run into a load of 100+ there is a severe degradation in the service, timeouts etc happen. In reality high normalized load over time (not talking about 1 minute spikes) should be avoided, this is based on my experience.
But a load of 100+ is considerably more than a normalised load of 1, unless we're talking about 100+ core machines.
Remember that processes in (D)isk Wait state count towards the load average even though they are not running or runnable.
This++
As well as network IO. This is why just by the load you can't tell what is going on and this is exactly the reason why I don't like it too much in production. A box should do a have a smooth 15minutes normalized load over time. (If your workload changes you can do autoscaling, I think we used the normalized load as the metric for scaling up and down).
At least according to the linux documentation, network IO is NOT part of the load average, only processes waiting for disk IO and runnable processes.
Processes blocked on a network socket should be Sleeping and not contributing the load average. However in the case of network filesystems processes may well be in Disk Wait while waiting for the server to respond.