Hacker News new | ask | show | jobs
by JanMa 1397 days ago
How does this setup handle a maintenance of the underlying hypervisor host? As far as I know the VM will be migrated to a new hypervisor and all data on the local SSDs is lost. Can the custom RAID0 array of local SSDs handle this or does it have to be manually rebuilt on every maintenance?
2 comments

On GCP, live migration moves the data on the local-ssd to the new host as well.
Oh nice, that's really cool. I am pretty sure last time I checked this was not the case (~2 years ago)
From the article:

> GCP provides an interesting "guarantee" around the failure of Local SSDs: If any Local SSD fails, the entire server is migrated to a different set of hardware, essentially erasing all Local SSD data for that server.

I wonder how md handles reads during the rebuild, and how long it takes to replicate the persistent store back onto the raid0 mirror.

I wonder how does this look from the host's perspective? Does the SSD disappear (from a simulated SATA bus that supports hot plug) and reappear? Does it just temporarily return read errors before coming back to life but the underlying blocks have silently changed (I hope not)? Etc.
I assume the host moves, not the disks. My next assumption is that the host moving would involve downtime for the host, so no need to bother simulating some hotplugs.

(I know that live migrations are at least in theory possible, but I don’t know why GCP would go through all the effort)

(I’m also making a lot of assumptions about things I am not an expert in)

The disks are physically attached to the host. The VM running on that host moves from one host to another. GCP live-migrates every single VM running on GCP roughly once per week, so live migration is definitely seamless. Standard OSS hypervisors support live migration.

When hardware fails, the instance is migrated to another machine and behaves like the power cord was ripped out. It's possible they go down this path for failed disks too, but it's feasible that it is implemented as the disk magically starting to work again but being empty.

You can read more about GCP live migrations here: https://cloud.google.com/compute/docs/instances/live-migrati...

When a local disk fails in an instance, you end up with an empty disk upon live migration. The disk won't disappear, but you'll get IO errors, and then the IO errors will go away once the migration completes but your disk will be empty.
> you'll get IO errors, and then the IO errors will go away once the migration completes but your disk will be empty.

This seems extremely dangerous as nothing notifies the OS to unmount the filesystem and flush its caches, leading to trashing of the new disk as well. The only way to recover would be to manually unmount, drop all IO caches, then reformat and remount.

When standard filesystems like ext4 and xfs hit enough io errors,they unmount the filesystem. I find that this happens pretty reliably in AWS at least and I can't imagine the filesystem possibly continuing to do very much when 100% of the underlying data has disappeared.

That said, from further reading of the GCP docs, it does sound like if they detect a disk failure they will reboot the VM as part of the not-so-live migration.

That's the failure mode for bad disk, but are you saying that in the normal case of live migrate (eg BIOS update needs to be applied to host machine), that the (data on the) local SSD is a also seamlessly moved to the new host, seamlessly and invisible to the guest VM?
Yes, under a graceful live migration with no hardware failure, the data is seamlessly moved to a new machine. The problem of moving local data is ultimately no different that live migrating the actual RAM in the machine. The performance does degrade briefly during the migration, but typically this is a very short time window.

You can read more about GCP live migrations here: https://cloud.google.com/compute/docs/instances/live-migrati...