| This analysis of steal time is not entirely correct. Steal time exists to fix a problem. When a hypervisor needs to pre-empt a running guest, without steal time, when the hypervisor eventually resumes that guest, as far as the guest can tell, the process that was running when the whole guest was pre-empted had run the entire time. This means that if a guest is pre-empted, then CPU usage reporting in the guest becomes horribly wrong with some processes having much higher reported usage than they actually got. This affects fairness and can cause lots of bad things. Steal time is simply a way to tell a guest that it was pre-empted. The guest OS can then use that information to correct its usage information and preserve fairness. However, it is not a general indication of overcommit. When a guest idles a VCPU, that VCPU will be put on the scheduler queue. It may receive an event that would normally cause it to awaken the VCPU however if the system is overcommitted, it may take much longer for the VCPU to be woken up. Most clouds are designed to allow multiple VCPUs per physical CPU too and there certainly is capping in place. You can still see steal time even though you are getting your full share. Let me give an example: 1) You are capped at 50%. You run for your full 50%, go idle, the hypervisor realizes you've exhausted your slice, and doesn't schedule you until the next slice. No steal time is reported. 2) You are capped at 50%. You have a neighbor attempting to use his full time slice. Instead of getting to run for the first half of your slice with the neighbor running for the second half, the hypervisor carves up the slice into 10 slots and schedules you both in alternative slots. Both guests see 50% steal time. You will get the same performance in both scenarios even though the steal time is reported differently. |
It's derived from stress testing and production across a variety of virtualisation platforms, and it's generally proven pretty accurate.