|
|
|
|
|
by t_sawyer
50 days ago
|
|
I've run an Openstack cloud. Local to the host NVME's directly attached to VMs is unbeatable. All clouds offer this. But that storage is ephemeral and it was when I implemented it in Openstack too. There's not enough redundancy. You could raid1 those NVME's when before they get attached to a VM and that helps with hardware failures, but you get less of them to attach. Even if you RAID them, there's not a good way to move that VM to another host if there's a RAM or CPU or other hardware issue on that host. These VM's with NVME's directly attached have to basically be treated as bare metal servers and you have to do redundancy at the application layer (like database replication). But again, all of the major cloud services offer these types of machines if you NEED NVME IO speed. There are quirks though. For example, in Azure it seems like you have to be able to expect the VM to be moved whenever Azure feels like it and expect that ephemeral data to be wiped. Whereas in Openstack, we would do local block level migrations if we HAD to move the VM to another host. That block level migration required the VM to be turned off but it did copy the local NVME data to another host. If this happened it was all planned and the particular application had app level redundancy built in so it was not a problem. If the host crashed, that particular VM would just be down till the host was fixed and came back online. |
|
The trick is building a block storage system that treats the local disk as write-back cache with async replication to networked storage. Like the blog post says they'll be doing.
The async replication has some integrity/recovery concerns for sure, but it the trick that enables local speeds. And people have been happy with async replication for their database for a very long time. Just need good observability for the durability delay.
Once you have that, you can do live VM migration if you're careful enough about dirty data. The new node just starts out with an empty cache.
It's not exactly trivial, but it's also probably not the biggest challenge if you're genuinely building a brand new cloud and going to compete against the hyperscalers. (Hell, hire me and I can write it for you. It'll take time and CPU hours to get stable, but the magic required is only mildly arcane.)
For example: https://dl.acm.org/doi/10.1145/3492321.3524271