Hacker News new | ask | show | jobs
by jsolson 3753 days ago
(Disclaimer: I work on the hypervisor that lives under Google Compute Engine)

1. PerfKitBenchmarker includes meaningful benchmarks for things like Redis, Aerospike, Memcache, etc. We expect GCE to score well on these when measured in terms of performance/$, and chunk of why we expect that is from superior network performance. Even small instance sizes tend to saturate their provisioned network long before they saturate provisioned CPU; GCE provisions more network (up to 2 Gbps/vCPU per our public docs).

This also applies to custom VM shapes. This allows workloads like memcache (which require very little CPU per request, typically) to be provisioned on small instances that still have relatively beefy networks with oodles of RAM with costs proportioned appropriately.

2. GCE handles instance failures differently from EC2. Certainly both platforms will have instance failures that cannot be solved with migration; this is absolutely something software stacks must work around. Live migration allows us to drive down the number of failure modes which cause an discontinuity in instance lifecycle, but obviously they cannot be eliminated entirely.

That said, when an instance in GCE fails it is by default restarted as quickly as possible (possibly on another host). To the guest this appears as an unplanned reboot. My understanding is that you can accomplish the same on EC2 by 'recovering' and instance[0], and that further you can automate this recovery with CloudWatch, but none of that is required on GCE.

I think we're in full agreement in terms of automating OPS, I'm just of the (obviously strongly biased) opinion that GCP is ahead in terms automating things on behalf of customers "out of the box".

[0]: I previously worked at Amazon, but in Retail at a time when the deployment tools for EC2 were... somewhat exotic. I lack experience with what the general best practices recommended to external customers is.

1 comments

1. Thanks Jon, this is exactly the sort of comment I was looking for. Yes I totally agree, if you have a memcache use case your are going to hit network limitations before you hit CPU. I was just pointing out that HTML rendering is different from running memcache or a distributed disk persisted key-value store. Amazon figured out the need for different use cases and introduced R3 instance types with few cores, large amount of memory and enhanced networking support. This is why I found a little-bit unfortunate the make general statements like "4 core instance has better networking on GCP". Depends which instance type you are using.

https://aws.amazon.com/about-aws/whats-new/2014/04/10/r3-ann...

2. Agreed, making it easier for the customers is always better.

Heh, I was working there when Retail moved to EC2, much fun! :)

Google Cloud platform offers Custom Machine types specifically to help you configure the most optimal CPU/RAM combinations:

https://cloud.google.com/custom-machine-types/

Quizlet's post alludes to Google's attitude as well. With exception of GPU instances, Google's VMs are generic. You are able to get incredibly fast SSDs, best in class networking, etc, on just typical instances. Benefits are pricing is simpler, spot instance/preemptible VM market is simpler, and you get much more architectural flexibility.

(Disclaimer - work on Big Data @ Google Cloud)

That should probably be emphasised a bit more in both the article & in general. It's fairly common to have wasted RAM or CPU or whatever because you had to pick a particular instance type in AWS ("I need better networking, so I'll have to pick a larger instance ... pity I don't need those extra cores").