Hacker News new | ask | show | jobs
by StreamBright 3756 days ago
I am not sure about the technical merit of this link. Best of show:

"Google probably has the best networking technology on the planet."

How do we quantify this?

"This is important for several reasons. On EC2, if a node has a hardware problem, it will likely mean that you'll need to restart your virtual machine."

I would much rather create a service that can tolerate single node outages than relying on "live migrations". I am not sure what he meant by the SSD comparison, Amazon EBS that can be SSD but still it is a network mounted storage.

"Most of GCP's technology was developed internally and has high standards of reliability and performance."

Guess what AWS was developed for.

I like hand-wavy, articles as much as any other guy, but it seems to me they picked GCP and wrote an article to justify it, an cooked up some numbers with single dimension comparisons to make it look like scientific. I wish I was working on single dimension problems in real life, but it is always more complex than that. I am more interested in worst case scenarios and SLAs than micro-benchmark results when comparing cloud vendors. Discarding Azure was purely arbitrary, in fact, Azure is more than happy running Linux or other non-Windows operating systems, I am not sure where he got the idea of " Linux-second cloud".

https://azure.microsoft.com/en-us/blog/running-freebsd-in-az...

7 comments

Disclaimer: I work for Google Cloud

> "Google probably has the best networking technology on the planet." How do we quantify this?

In the article they did a bunch of tests. Quote: GCP does roughly 7x better for the comparison of 4-core machines, but for the largest machine sizes networking performance is roughly equivalent.

There is also https://github.com/GoogleCloudPlatform/PerfKitBenchmarker if you want to benchmark things yourself.

Seriously, try it yourself. I think you will be pleasantly surprised.

> I would much rather create a service that can tolerate single node outages than relying on "live migrations".

Services should tolerate node failure even on GCP, live migration does not really help with that. It's more about reducing ops. With AWS, you have to manually reboot your machines when a infra upgrade happens. With GCP it is automatic.

> I am not sure what he meant by the SSD comparison, Amazon EBS that can be SSD but still it is a network mounted storage.

I'm not too sure what your question is?

> Discarding Azure was purely arbitrary

Agreed, would love to know more about why they didn't consider Azure

Disclaimer: I used to work for Amazon, does not own any AMZN anymore

1. AWS does explicitly tells you up front that smaller instance sizes come with smaller network throughput. This is well known and well communicated even when you browse the instance offerings. Doing 7x better for a 4 core instance is hardly relevant (depending on the actual CPU type though), being able to saturate your pipe would probably consume much of your CPU time and you could hardly do anything else on the box. You can prove me wrong on this one. Synthetic benchmarks are not really relevant for production use cases.

A good read in the subject: http://www.brendangregg.com/activebenchmarking.html

2. On reducing OPS. You are implying that these OPSy things are not automated. You should ask your SRE co-workers about this one. For running a website this scale, you absolutely need to automate cases when the server is rebooted. Meaning, on shut down it needs to remove itself from the load-balancer or from the resource pool, and when it comes back it has to put itself back. Worst case scenario you can just terminate the instance and let auto-scaling do its job. All of these are completely human attention free operations in most cases, but I do understand that some smaller customers are not so advanced with automation and GCP might be optimizing for those clients.

3. I do not have any question, as I pointed out that in the article the author is talking about EBS while it might appear to the reader that he is talking about some sort of local SSD.

4. Great! I would like to know it too! We should petition together. :)

(Disclaimer: I work on the hypervisor that lives under Google Compute Engine)

1. PerfKitBenchmarker includes meaningful benchmarks for things like Redis, Aerospike, Memcache, etc. We expect GCE to score well on these when measured in terms of performance/$, and chunk of why we expect that is from superior network performance. Even small instance sizes tend to saturate their provisioned network long before they saturate provisioned CPU; GCE provisions more network (up to 2 Gbps/vCPU per our public docs).

This also applies to custom VM shapes. This allows workloads like memcache (which require very little CPU per request, typically) to be provisioned on small instances that still have relatively beefy networks with oodles of RAM with costs proportioned appropriately.

2. GCE handles instance failures differently from EC2. Certainly both platforms will have instance failures that cannot be solved with migration; this is absolutely something software stacks must work around. Live migration allows us to drive down the number of failure modes which cause an discontinuity in instance lifecycle, but obviously they cannot be eliminated entirely.

That said, when an instance in GCE fails it is by default restarted as quickly as possible (possibly on another host). To the guest this appears as an unplanned reboot. My understanding is that you can accomplish the same on EC2 by 'recovering' and instance[0], and that further you can automate this recovery with CloudWatch, but none of that is required on GCE.

I think we're in full agreement in terms of automating OPS, I'm just of the (obviously strongly biased) opinion that GCP is ahead in terms automating things on behalf of customers "out of the box".

[0]: I previously worked at Amazon, but in Retail at a time when the deployment tools for EC2 were... somewhat exotic. I lack experience with what the general best practices recommended to external customers is.

1. Thanks Jon, this is exactly the sort of comment I was looking for. Yes I totally agree, if you have a memcache use case your are going to hit network limitations before you hit CPU. I was just pointing out that HTML rendering is different from running memcache or a distributed disk persisted key-value store. Amazon figured out the need for different use cases and introduced R3 instance types with few cores, large amount of memory and enhanced networking support. This is why I found a little-bit unfortunate the make general statements like "4 core instance has better networking on GCP". Depends which instance type you are using.

https://aws.amazon.com/about-aws/whats-new/2014/04/10/r3-ann...

2. Agreed, making it easier for the customers is always better.

Heh, I was working there when Retail moved to EC2, much fun! :)

Google Cloud platform offers Custom Machine types specifically to help you configure the most optimal CPU/RAM combinations:

https://cloud.google.com/custom-machine-types/

Quizlet's post alludes to Google's attitude as well. With exception of GPU instances, Google's VMs are generic. You are able to get incredibly fast SSDs, best in class networking, etc, on just typical instances. Benefits are pricing is simpler, spot instance/preemptible VM market is simpler, and you get much more architectural flexibility.

(Disclaimer - work on Big Data @ Google Cloud)

That should probably be emphasised a bit more in both the article & in general. It's fairly common to have wasted RAM or CPU or whatever because you had to pick a particular instance type in AWS ("I need better networking, so I'll have to pick a larger instance ... pity I don't need those extra cores").
1. For micro-instance and 32 core instance, the difference AWS and Google is not a big deal. For rest of the instances, Google Cloud is 4 - 7 times faster. That's not a "synthetic" benchmark.

2. Yes, everyone should automate Ops, but if the Cloud provider takes away some of the pain, its a win.

1, I am not arguing with that, I arguing it is not relevant for most of the use cases, you get the conclusion based on a synthetic benchmark. Real life example: running a service for rendering HTML, you use most of your CPU time for the actual rendering and some for the communication, you are not network bound even on a 4-7 times slower network. Again, you might find a use case that use very little CPU and all of the network IO. In that case it is relevant that GPC is 4-7 faster.

2. Sure and this is hardly relevant to me because I automate most of my work. For small customers with less automation it is more relevant as I pointed out.

If you're interested in datacenter networking, can't miss ONS. Amin Vahdat, the guy in charge of networking at Google, has keynoted the past two years, with hardware in 2015[1] and software in 2014[2]. Mark Russinovich also spoke about Azure's SDN, but it doesn't come close to Amin's presentation.

Every once in a while you'll hear some random spec from one of these companies, and it's always pretty surprising, but Amin's team has achieved 5 Petabit/s in bisectional bandwidth. It's more than surprising.

[1] https://www.youtube.com/watch?v=FaAZAII2x0w [2] https://www.youtube.com/watch?v=n4gOZrUwWmc [3] https://www.youtube.com/watch?v=RffHFIhg5Sc

Thanks for the links! This is a pretty interesting topic and I am going to watch these videos.
We did a performance analysis of Google Cloud vs AWS. The results are in line with what is published in the post. The biggest thing that we can not quantify is "ease of use". Google Cloud is a pleasure to work with. AWS feel so clunky compared to Google Cloud. Don't take my word for it. Create a VM, login into it on AWS and Google Cloud, you will change your opinion about what a good cloud is.
If you're using the GUI to manage your resources rather than going the Infrastructure As Code route, you're probably doing it wrong. You should be using a tool like Terraform, which lets you use multiple cloud providers (https://www.terraform.io/docs/providers/), and can actually tell you if there are any immediate errors before attempting to launch a resource, so is friendly with Jenkins or any other CI tool you prefer to use as a result as well.
We don't use GUI to manage our resources. We use CloudFormation for AWS and Deployment Manager for Google. Let me tell you a couple of things about those services. In AWS some resources are zonal, some regional and some are global. It's a mess to work with. For example, same AMI image has different ids in different regions. You need to create maps and stuff to make your code work across regions. Come to Google Cloud, no more zonal/regional/global fuss. An image is a global resource. It's available by the same id in all regions. Your infrastructure template looks much cleaner. Combine the power of Jinja, you can create far powerful templates and evaluate them on the fly. AWS has "three" queuing systems, "two" storage solutions with different API's and different quirks. Google just has one and its nails the use cases for queuing and storage. AWS micro-instance go poof, without any notice. Their NATS are known for being unreliable. Load balancers can't scale. Every service that I looked into, Google is way better than AWS.
AWS definitely feels like a product that has grown organically and needs some house cleaning. GC was able to look at AWS, take those lessons, and improve from the start.
Being late for the party has its perks. The also need more work on convincing people moving over and they do a good job with Spotify and other companies who talk openly about moving to GCP.
> AWS has "three" queuing systems, "two" storage solutions with different API's and different quirks. Google just has one and its nails the use cases for queuing and storage.

This is not currently true. Google has: Datastore, Cloud SQL, Bigtable, BigQuery and Cloud Storage [1]. Each is intended for a different use case, as are Amazon's offerings.

[1] https://cloud.google.com/datastore/docs/concepts/overview#da...

(disclaimer: I work on Google Compute Engine)

For queuing AWS has at least SQS and SNS, both of which solve roughly half of what's commonly desired from a queuing system. Google Cloud PubSub coalesces both of these behind a single API that provides clear support for common queuing patterns (1:1, 1:n, n:1, n:n).

In terms of storage, I think what the OP was referring to was S3 versus Glacier when compared against Cloud Storage (which offers competitors to both S3 and Glacier within the same API -- just mark a bucket as Nearline as pay less for cold stored objects).

If you count all of the additional AWS services that are logical equivalents to the Google ones mentioned you have SimpleDB, RDS, DynamoDB, and Redshift. So yes, many options for many different use cases, but Google coalesces things under a single API where the "verbs" are the same (as in the case of blob storage).

For me, GCP comes with unquantifiable existence risk. As in, how do I know that it won't get shut down in 5 years when some VP sees that it's not bringing in as much money as it should? I trust Amazon more in this regard, and their offering is not "so bad" that I feel a need to switch.
I understand your point, but GCP does address this in the terms of service (disclaimer: I work on Google Cloud):

7. Deprecation of Services

7.1 Discontinuance of Services. Subject to Section 7.2, Google may discontinue any Services or any portion or feature for any reason at any time without liability to Customer.

7.2 Deprecation Policy. Google will announce if it intends to discontinue or make backwards incompatible changes to the Services specified at the URL in the next sentence. Google will use commercially reasonable efforts to continue to operate those Services versions and features identified at https://cloud.google.com/terms/deprecation without these changes for at least one year after that announcement, unless (as Google determines in its reasonable good faith judgment):

(i) required by law or third party relationship (including if there is a change in applicable law or relationship), or

(ii) doing so could create a security risk or substantial economic or material technical burden.

The above policy is the "Deprecation Policy."

Not just the whole platform, but specific APIs / services being deprecated / shut down too. Amazon isn't untouched by this problem (see also: VPC migration from EC2 Classic) , but I agree that given their reputation, I don't trust Google to keep even very useful widely-loved stuff around forever.
Existence risk here is HUGE. If GCP doesn't move the needle, Google will shut it down. AWS is a much more living organism and I can't see Amazon shutting it down before their drones take over ...
To shut down GCP, they would be shutting down the same services and infrastructure that power their own services. The dogfooding memo from years back is being taken to heart, and you're seeing more and more exposure of internal services and infrastructure.

The scary thing would be if it ends up reducing the rate of innovation because they worry about changing APIs or services too much; obsolete accumulates quickly when you're serving large numbers of people because most business want to write once run till it's dead. But this is true of any service that exposes anything but an extreme abstraction.

Right after they sealed the huge deal with Spotify? Seems like a near-term shutdown is unlikely.
Yes, unfortunately, GCE's reputation is tarnished by Google's approach to consumer-level services. I hope this changes over time - we need more competitors in this space.
Long-time user of AWS. I've never had a NAT service fail, nor a micro-instance disappear. I haven't had much exposure to GCE, but the removal of zones etc is interesting. How do you guarantee that your servers aren't sitting in the same data centre?
Terraform has worked incredibly well for us so far. Definitely deserves a look by anyone.
I have created more than a single VM on AWS, if I add together all of the companies that I used to work for it is close to 8000 instances (5000+3000). I am not sure what I am doing wrong not running into clunky stuff but I guess it is automation that makes the difference. With projects like Ansible, Terraform or even aws cli creating and managing these large clusters is a breeze. I understand that you are using the UI and having trouble with UX but it does not mean that every user experiences that or they have the same sentiments or conclusions.
> Guess what AWS was developed for.

EC2 being developed for internal use is more myth than fact. The original idea was for internal use, but didn't exist much beyond a "short paper"[1] until it was green-lighted (by Bezos) as an external/sellable service.

[1] http://blog.b3k.us/2009/01/25/ec2-origins.html

Author provided a lot of information. I wouldn't call it a "hand wavy" article at all.
Well, providing lots of information vs. providing meaningful in-depth analysis are very different. I see your point though.
Best networking is a dubious claim. On the bigger AWS instances you can bypass the hypervisor with SR-IOV. AFAIK you still can't do that with GCP. So if you really need maximum network performance AWS will likely win, especially on latency.
On the contrary, this is one of the factors where GCP wins hands-down. In addition to the benchmark shown in the OP, check out one I posted on slide 14 of [1]. GCP achieves Gbit speeds on almost all instance types, and has higher speeds than AWS's biggest machines.

[1] https://docs.google.com/presentation/d/1B1jvWWh0ACaDv4ryEzLl...

I stand corrected. It also appears from those charts that the bigger GCP machines have more than 10gbps connection. It looks like 2x10gbps. That would explain why instances of all sizes are able to push more network traffic.

Mind you the benchmarks are for bulk transfer with 9001 MTU. With more jittery workloads with lots of small packets, like a webserver has to deal with, then you see the benefit of SR-IOV. So AWS may still have the advantage on some workloads, and some measures (maybe latency, maybe CPU usage per packet.) However, it's clear if Google can support SR-IOV in the future they will mop the floor with AWS on networking because their network infrastructure is obviously superior.

Anyone know where one can go to get a more quantitative, price/performance comparison of the various cloud services?

I know, lots of dimensions for that comparison, but probably picking a few dimensions, or letting users select a few, could give a reasonable ranking of services and prices.

answering my own question, this looks pretty cool (no affiliation): https://www.cloudorado.com/cloud_server_comparison.jsp