| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jsolson 4291 days ago

(note: I work on GCE networking)

Yes, certainly, it's always possible that hardware will fail with no notice, and any architecture needs to account for that (this is equally true if you're running your own infrastructure, of course). For some applications downtime in these situations may be the right value tradeoff, for others they'll need to architect things assuming always-on redundancy. In terms of data durability, though, local SSD gets you a performance win by being local. If there's a catastrophic failure of that machine, data is gone.

What live migration buys you, the prospective cloud customer, is transparent avoidance of (a) Google's planned hardware maintenance windows (for example, network or power infrastructure maintenance) and (b) outages due to hardware failures which can be detected and migrated away from before they cause data loss.

The second category includes any number of situations which, if you can't migrate your workload, require taking a machine out of service, replacing some gear, bringing it back up, and hoping the replacement fixes the issue[0]. If we can instead migrate your VM away from the issue (say, for example, it's a bad root hard disk -- we need to replace the drive as lots of things in the VM hosting environment depend on it, but your workload is otherwise completely unaffected), we are able to service the affected machine with zero downtime for your service.

[0]: Think about the number of things that could manifest as a flaky network connection. Could be any (or all) of: bad cable, bad NIC, bad motherboard, bad CPU, or even bad RAM (since NICs use bus-mastering DMA from host memory in most cases). After migrating the workload away, we can take as long as necessary to reliably diagnose and fix the machine. Huge win for us and for our customers.