Hacker News new | ask | show | jobs
by thesandlord 3745 days ago
Disclaimer: I work for Google Cloud

> "Google probably has the best networking technology on the planet." How do we quantify this?

In the article they did a bunch of tests. Quote: GCP does roughly 7x better for the comparison of 4-core machines, but for the largest machine sizes networking performance is roughly equivalent.

There is also https://github.com/GoogleCloudPlatform/PerfKitBenchmarker if you want to benchmark things yourself.

Seriously, try it yourself. I think you will be pleasantly surprised.

> I would much rather create a service that can tolerate single node outages than relying on "live migrations".

Services should tolerate node failure even on GCP, live migration does not really help with that. It's more about reducing ops. With AWS, you have to manually reboot your machines when a infra upgrade happens. With GCP it is automatic.

> I am not sure what he meant by the SSD comparison, Amazon EBS that can be SSD but still it is a network mounted storage.

I'm not too sure what your question is?

> Discarding Azure was purely arbitrary

Agreed, would love to know more about why they didn't consider Azure

1 comments

Disclaimer: I used to work for Amazon, does not own any AMZN anymore

1. AWS does explicitly tells you up front that smaller instance sizes come with smaller network throughput. This is well known and well communicated even when you browse the instance offerings. Doing 7x better for a 4 core instance is hardly relevant (depending on the actual CPU type though), being able to saturate your pipe would probably consume much of your CPU time and you could hardly do anything else on the box. You can prove me wrong on this one. Synthetic benchmarks are not really relevant for production use cases.

A good read in the subject: http://www.brendangregg.com/activebenchmarking.html

2. On reducing OPS. You are implying that these OPSy things are not automated. You should ask your SRE co-workers about this one. For running a website this scale, you absolutely need to automate cases when the server is rebooted. Meaning, on shut down it needs to remove itself from the load-balancer or from the resource pool, and when it comes back it has to put itself back. Worst case scenario you can just terminate the instance and let auto-scaling do its job. All of these are completely human attention free operations in most cases, but I do understand that some smaller customers are not so advanced with automation and GCP might be optimizing for those clients.

3. I do not have any question, as I pointed out that in the article the author is talking about EBS while it might appear to the reader that he is talking about some sort of local SSD.

4. Great! I would like to know it too! We should petition together. :)

(Disclaimer: I work on the hypervisor that lives under Google Compute Engine)

1. PerfKitBenchmarker includes meaningful benchmarks for things like Redis, Aerospike, Memcache, etc. We expect GCE to score well on these when measured in terms of performance/$, and chunk of why we expect that is from superior network performance. Even small instance sizes tend to saturate their provisioned network long before they saturate provisioned CPU; GCE provisions more network (up to 2 Gbps/vCPU per our public docs).

This also applies to custom VM shapes. This allows workloads like memcache (which require very little CPU per request, typically) to be provisioned on small instances that still have relatively beefy networks with oodles of RAM with costs proportioned appropriately.

2. GCE handles instance failures differently from EC2. Certainly both platforms will have instance failures that cannot be solved with migration; this is absolutely something software stacks must work around. Live migration allows us to drive down the number of failure modes which cause an discontinuity in instance lifecycle, but obviously they cannot be eliminated entirely.

That said, when an instance in GCE fails it is by default restarted as quickly as possible (possibly on another host). To the guest this appears as an unplanned reboot. My understanding is that you can accomplish the same on EC2 by 'recovering' and instance[0], and that further you can automate this recovery with CloudWatch, but none of that is required on GCE.

I think we're in full agreement in terms of automating OPS, I'm just of the (obviously strongly biased) opinion that GCP is ahead in terms automating things on behalf of customers "out of the box".

[0]: I previously worked at Amazon, but in Retail at a time when the deployment tools for EC2 were... somewhat exotic. I lack experience with what the general best practices recommended to external customers is.

1. Thanks Jon, this is exactly the sort of comment I was looking for. Yes I totally agree, if you have a memcache use case your are going to hit network limitations before you hit CPU. I was just pointing out that HTML rendering is different from running memcache or a distributed disk persisted key-value store. Amazon figured out the need for different use cases and introduced R3 instance types with few cores, large amount of memory and enhanced networking support. This is why I found a little-bit unfortunate the make general statements like "4 core instance has better networking on GCP". Depends which instance type you are using.

https://aws.amazon.com/about-aws/whats-new/2014/04/10/r3-ann...

2. Agreed, making it easier for the customers is always better.

Heh, I was working there when Retail moved to EC2, much fun! :)

Google Cloud platform offers Custom Machine types specifically to help you configure the most optimal CPU/RAM combinations:

https://cloud.google.com/custom-machine-types/

Quizlet's post alludes to Google's attitude as well. With exception of GPU instances, Google's VMs are generic. You are able to get incredibly fast SSDs, best in class networking, etc, on just typical instances. Benefits are pricing is simpler, spot instance/preemptible VM market is simpler, and you get much more architectural flexibility.

(Disclaimer - work on Big Data @ Google Cloud)

That should probably be emphasised a bit more in both the article & in general. It's fairly common to have wasted RAM or CPU or whatever because you had to pick a particular instance type in AWS ("I need better networking, so I'll have to pick a larger instance ... pity I don't need those extra cores").
1. For micro-instance and 32 core instance, the difference AWS and Google is not a big deal. For rest of the instances, Google Cloud is 4 - 7 times faster. That's not a "synthetic" benchmark.

2. Yes, everyone should automate Ops, but if the Cloud provider takes away some of the pain, its a win.

1, I am not arguing with that, I arguing it is not relevant for most of the use cases, you get the conclusion based on a synthetic benchmark. Real life example: running a service for rendering HTML, you use most of your CPU time for the actual rendering and some for the communication, you are not network bound even on a 4-7 times slower network. Again, you might find a use case that use very little CPU and all of the network IO. In that case it is relevant that GPC is 4-7 faster.

2. Sure and this is hardly relevant to me because I automate most of my work. For small customers with less automation it is more relevant as I pointed out.