Hacker News new | ask | show | jobs
by devhwrng 2377 days ago
The technical details about the E2 instance class are really interesting:

https://cloud.google.com/blog/products/compute/understanding...

Rather than a guaranteed core and RAM as with N1/N2, resources for the underlying host can be dynamically balanced through live migrations, which GCP has already been using for years. Cool solution, and should work to save money for most workloads.

2 comments

I wonder when we will get instances that can scale dynamically at runtime!

That would be so cool, just adding cores if the load goes up!

You would have to make sure your code has enough threads ready to fill those cores though! (if you use non-blocking async. stuff)

Or is this what they mean it already has?

Edit: thinking more about this it must be really hard and require kernel fixes?

I mean how would linux behave when you add/remove cores and RAM f.ex.?

It’s a lot easier and safer to scale hosts horizontally than vertically. You can predict the limits and behavior of each host, the VMs/processes on each host don’t need to deal with fundamental resources changing, etc. For services I own that are high availability, require GC tuning, etc., these hosts with dynamic resource adjustments (also T2/T3 in AWS) are a nightmare because the behavior can change at runtime under load, exactly when I want it to behave predictably.
Sure but:

1) Some things can't scale on more hosts, like say an action MMO with no sharding.

2) Scaling dynamically does not necessarily mean you have to do it unpredictably.

Are you running them as unlimited or standard?

Oh definitely there are valid use cases for these, was just sharing my experience with them for my use cases.

We moved off of T2’s and back to C’s because of the unpredictable behavior under load. IIUC, T3s by default just bill you more instead of CPU throttling, which is a bit better for our use cases, but we haven’t tried them yet.

Aha, thanks for that very valuable information!

T3 look cheaper and better than E2 then, my only problem is region placement where Iowa and Taiwan are more central than anything AWS offers (still no central US region!?).

I'm in the MMO business, so very specific requirements.

Disclosure: I work on Google Cloud.

T3 is pretty different (even in unlimited mode) than E2. As an example, t3.xlarge (4 vCPU, 16 GB, $.167/hr, so $.042/hr/vCPU roughly) only has a baseline performance of 40% (so 1.6 vCPU). If you cross that threshold in unlimited mode you pay an additional $.05/vCPU/hr (so more than doubling your cost). By comparison an e2-standard-4 is $.134/hour even if you run it flat out.

We take on the statistical multiplexing over the datacenter and move VMs around, instead of pushing it to you as an economic or performance-throttling risk when you need it most. If you want a burstable type, we do have an e2-{micro, small, medium} that only guarantees you 12.5%, 25% and 50% of your 2 guest-visible vCPUs. But that's more fit for dev workstations and so on.

CPU hotplug has been supported for a long time. I once managed some Sun boxes that allowed replacing/upgrading CPUs without shutting down... They don't build em like that anymore.
Disclosure: I work on Google Cloud.

Yes, but most workloads are fairly unprepared for this sadly. And they're really not ready for memory unplug. (I also miss the days of my multi socket boxes and plugging in CPUs and memory).

> And they're really not ready for memory unplug.

What do VM-guest memory-ballon drivers do right now when the host suddenly attempts to reserve more memory than the guest has free? I'd presume the kernel would just consider itself to be in an OOM condition, and start killing processes to free up the memory until it can return OK to the balloon driver, no?

Because, from what I understand, that's closer to the scenario we're talking about here: you're not abruptly yanking DIMMs (like physical memory hotplug); rather, you (the hypervisor) are gracefully letting the guest know that some memory is about to go away, and since you (the hypervisor) have your own virtual TLB, you can let the guest OS decide which "physical" memory (from its perspective) is going away, before it happens.

Yep! I was just responding to the explicit "how come you don't do hotplug" :).
Linux and Windows have both supported it, but use tends to be at the fringes on mainframe/datacenter machines that are validated for it and so those paths aren't tested on a very wide variety of hardware and running applications. And adding CPUs and memory is one thing but removing is another.
CPU cores being hotplugged on & off was actually super common for a few years, and still is in a lot more devices than you'd expect.

It used to be a corner stone of power management on mobile devices. The Nexus 5, for example, would regularly run with just a single core online, hotplugging the other 3 off until hit with a load and then brought cores back online 1 by 1 as needed.

That behavior still is in some corners of the mobile world, but increasingly less so.

So the CPU hotplug path is as a result actually a lot more battle hardened than you'd expect, and a lot more consumer software than you'd think ran just fine in that setup without noticing.

It's been supported with VmWare for a while too, maybe a whole decade.
> I mean how would linux behave when you add/remove cores and RAM f.ex.?

This is already possible on ESXi with Linux guests for years now, so it’s certainly a solved problem in some capacity.

> This is already possible on ESXi with Linux guests for years now, so it’s certainly a solved problem in some capacity.

And has been possible on KVM (e.g. VirtManager, RHV/RHEV/Ovirt) for years too.

Xen 3 also supported online memory increases (but I don't think CPU).

I presume that this means that E2 instances won't have access to local scratch NVMe, since making use of local scratch NVMe disks currently prevents any feature that requires a live migration, like auto-migration on host maintenance, or modifying the VM's specs while stopped (as you can't stop VMs with local storage, only terminate them permanently.)
You can still get migrated with a local ssd:

"Compute Engine can also live migrate instances with local SSDs attached, moving the VMs along with their local SSD to a new machine in advance of any planned maintenance." [1]

[1] https://cloud.google.com/compute/docs/instances/live-migrati...