Hacker News new | ask | show | jobs
by vicpara 985 days ago
Just moved our infra from GCP to AWS. Kubernetes clusters, LB, storage, lambdas, KMS and all of it.

Google runs their tech stack as if it's a startup that builds their CV. Everything is immature, tons of hacks, undocumented features. If you are on their k8s there are tons of upcoming new versions and features that force you to revisit key hacks you put in your infra because of their misgivings. Our infra team keeps tinkering around our infra and it never ends. It's 50:50. 50% of time making sure we are prepared for their shit and 50 % our ambitious infra plans. Good luck with that.

With AWS our bill is 60% of what GCP used to be running 3 k8s clusters.

AWS support is so nice, you can't believe it.

Nah, I don't trust Google with anything. It's a scam. Google's support is horrendous. They refer you to idiots that drag you through calls until your will for life dies. And you're back to the mercy of some lost engineer that may comment on a github issue you opened 20 days ago. We have a bug reported back in 2020 that got closed recently without any action because it became stale and the API changed so much it doesn't really matter. It's that bad.

The billing day is a monthly reminder you're paying entitled devs to do subpar work other companies do a lot better.

No, we don't miss them already.

19 comments

Interesting, if you swap GCP and AWS in your post then thats exactly my experience.

I wonder what makes us different, I work in europe on video games; AWS’s handling of me when I was at Ubisoft left a really sour taste - when I moved into Tencent/Sharkmob I tried really hard to love AWS as it was the defacto industry standard and instead I was left with a feeling that most of it is inconsistent garbage papered over with lambda functions. I referred to these weird gotchas as “3am topics”; things that I don't have the mental capacity to deal with at 3am and convinced the studio to switch to GCP- which, incidentally they are still extremely grateful to me for doing.

> I was left with a feeling that most of it is inconsistent garbage papered over with lambda functions

This sounds more like an indictment of the system design than the cloud provider.

What are some of these “3am” topics that made GCP a better choice?

Small examples included (I’m on my phone so these are from memory and you’ll have to forgive the lack of great detail):

1) having the project/account your in visible at the top at all times.

We used SSO for “accounts” which is AWS’s way of completely separating resources; the long string that is returned is not unique in the start and the remainder is cut off: so all accounts/projects looked the same, was impossible to tell at a glance if you were in dev, staging or prod.

2) Autoscaling groups with that had human readable incrementing “names”, in AWS instances have hex slugs as instance names and you can give an instance a special “Name” label: but any new machines created with an ASG will just reuse the same name label making them hard or impossible to tell apart.

The AWS official solution for this is to have a lambda function hook on the scale event and give your new node an incremented name label. Given that AWS is pricy to save me time: I do not personally consider this an elegant solution.

3) having all regions on one page.

We spent €6,000~ on a database we didn't know about until we started digging into the bill. Not knowing what resources are available at a glance feels pretty basic to me tbh.

4) the network implementation overall; in Google you can just make a network and it will work without having to mess with zone routing and configuration of that which is put on the user.

If it’s on the user, it’s a variable that has to be checked during an outage; it is terraform code that has to be grokked and so-on.

“2) Autoscaling groups with that had human readable incrementing “names”, in AWS instances have hex slugs as instance names and you can give an instance a special “Name” label: but any new machines created with an ASG will just reuse the same name label making them hard or impossible to tell apart. The AWS official solution for this is to have a lambda function hook on the scale event and give your new node an incremented name label. Given that AWS is pricy to save me time: I do not personally consider this an elegant solution”

Why were you even messing with the instance name? This is a ridiculously simple problem to solve with tags on your ASG. And AWS even did the courtesy of propagating those tags across the ASG and all its instances.

https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-au...

> 1) having the project/account your in visible at the top at all times.

I agree that this is an annoying issue in the AWS web console.

I assume this is something that could be fixed on your end by a little bit of CSS.

I believe the solution is to give the account an alias.

https://docs.aws.amazon.com/IAM/latest/UserGuide/console_acc...

company im working at currently uses Token Vending Machine.

pros: cannot get accounts mixed up.

cons: All sessions are actually 12hr sessions (ASIA not AKIA) and no access to perm keys for cli, security i suppose. Its not too bad though as TVM gives creds for various use cases.

https://aws.amazon.com/blogs/apn/tag/token-vending-machine/

we fix that internally by having names for accounts and having stages for accounts in a meta tool. There's a tampermonkey script that pulls that in and shows it on screen and a red banner if it's prod. Could be a json file in a github repo. And yes it could be a console feature but everyone's got different concepts of prod. I think a ton of companies use like 2 total accounts as well.
It's amazing how people complain about GCP. We run a massive deployment across 100+ regions cross-cloud GCP, Azure, AWS and oh boy. GCP has good support if you are big enough. Azure though which has a much bigger share than GCP is horrendous. Absolutely garbage all around. Good luck ever getting anyone in Engineering even if you are paying for support. AWS on the other hand - Amazing. We have Ent Support so those guys in our slack channel. The TAMs are amazing. Need to get hold of someone in Route53 no problem they are on the call this week. Feature request for EKS - ok talk to the Product Manager this afternoon.

Azure is a dumpster fire from the ground up.

Can you give Azure specifics, as you know Azure has a massive offering.

My experience has been the opposite though not without issues, Azure has some of the best corporate and security features of any cloud and it's only getting better. The zero trust model fits in so nicely with their identity platforms it's a sight to behold compared to other cloud providers which likely use some form of AAD or AD DS anyway.

Their support is responsive and they seem to know what they're talking about. (AKS)

Please provide some specifics on your experience?

Azure has had repeated, significant security failures that impact numerous customers. I don't understand how anyone can defend their security except through willful ignorance.

I have friends forced to use Azure and they routinely report issues with provisioning resources, things taking a very long time to spin up or simply being rejected because Azure doesn't have any capacity.

A memorable example is when we ran a heavy Azure Functions workload on our App Service Plan, the hosts would devour themselves.

Functions use containers under the hood. Each invocation created a new container, and when enough of them ran long enough, the host disk would fill up. (Pretty sure our workload wrote almost nothing to disk.)

An internal Azure disk clean-up routine kicked in, which deleted image layers for running Functions. This deleted the filesystems for containers that were still running, yanking them out from underneath the running processes. It also meant the host couldn't launch new instances of our Functions.

At this point the host was poisoned and couldn't launch any new work, even after the workload was reduced, It had to be terminated and replaced, after we detected the problem manually.

Azure support never weemed to take the problem seriously, and after we migrated our workload off of Functions they decided the problem must be resolved since we weren't complaining anymore.

> AWS support is so nice, you can't believe it.

This reminds me of the fond days of having weekly customers calls. We develop AWS services, and we answer our customer-support calls directly. No middle man. Just techies to techies. And we made promises to customers on the fly, and customers sometimes project managed us.

Sounds like hell...
The last part is a over the top but direct access to customers as a dev is a plus, not a minus.
Customers aren’t that bad, really.
We have an old "quiet part out loud" corporate story. It's about how one arm of Google using our service and wondering why it had so much downtime, only for us to point at their GAE arm and say "when they're down, we're down". They went and talked to GAE and - funny enough - were able to correlate the downtime they observed with GAE downtime.

GAE uptime improved, for a little while. Yeah, we're on AWS now too.

Does Google run on GCP?
From my understanding they don't dogfood a lot of gcp products internally. That's how you end up with janky integrations between their products. It's really frustrating at times to see their cloud architects pitch some grouping of technologies that you should use to find out the integrations aren't well tested at scale. For example, pushing for pubsub to be used with dataflow for near real time processing just to figure out at scale global pubsub has high latency, above 1 minute sometimes 5 minutes, on 1% of messages at scale.
Yes in the sense that they use all the services and infrastructure that GCP is built in, but no in the sense of using the vanilla GCP interface.

Instead many aspects of GCP's management console are handled by different internal tools, often command line driven. IME they are often far more unwieldy than GCP.

Sometimes this makes sense (far tighter access controls and configuration change controls than a typical company), and some times it's just because of legacy ways of doing things.

I worked on a team at Google that used the internal GCP to serve some code/content for a specific feature, and it was in some ways it was more frustrating than using just either the normal internal systems or just vanilla GCP.

Parts, yes. In reference to the specifics mentioned in here though, those services run on Infra Spanner, not Cloud Spanner, but they're the same stack. The main reason things like Gmail, Ads, etc haven't swapped into GCP is because of the internal tooling that's built up around the infra spanner relating to those services specific to Google that don't make sense in Cloud Spanner.
It's way WAY more than just Infra Spanner vs Cloud Spanner. Cloud spanner doesn't support protobuf, which is annoying, but that's not a dealbreaker; it's still just a DB. The issue is really all the various internal frameworks (such as Apps Framework for Java), deployment systems (Server Platform, AKA Boq/Pod/Urfin), and so forth.
Of course, I was simplifying. It's always more complicated doing a migration. :)
Not just migrations are hard, either; Google Cloud has put (almost?) zero effort into making it easy to use Cloud from systems running on Borg.

My old team was building a system that was half-GCP and half-Borg, and we had to write our own (extremely bad) Cloud Spanner fake for use in tests. In contrast, Infra Spanner is extremely well supported for tests. Same with BigQuery vs Dremel and many other systems.

this is maybe the most ill-informed post about google I've ever seen on this site, wow
Short answer: No
Borg still? Supposing you can say. Don’t reply if it’s still borg and you can’t.
Mostly not.
no
> AWS support is so nice, you can't believe it.

This! They even custom-coded their support portal better than those off-the-shelf vendor like Zendesk. I say this as a Zendesk paying customer.

GCP on the other hand, is a F-tier in support. Almost feel like I need to beg them to get any level of help.

At one point, Google reached out to me to try and tempt us over from AWS. I had bad experiences with Google support in the past, but liked their AI stuff and was keen to give them another go.

We booked a follow up call in the calendar, I spent good time preparing my notes and requirements for the meeting... and then nobody on their side showed up or contacted me again.

A GCP issue was the only time I had a human contact with Google, they did well. However high scale low touch is in their DNA and you can tell it.
Wouldn't Zendesk be one of those software things that has had too many features bolted on without oversight and/or a unified philosophy behind it?

I'm of the opinion that focused products created by smaller teams are better.

Much of the time GCP feels like a science project, and not a real business. AWS (and Azure) seem to be driven by customer requests, instead of Google, which feels very engineering-centric.

Which is on brand with Google. They have no problem launching stuff, and no problem killing stuff. But man, then just get out of the cloud business and focus on what you're good at.

> AWS support is so nice, you can't believe it.

It's actually sort of ridiculous. AWS has the best support I have ever interacted with. I mean, our org certainly pays enough for it but it's so completely unusual in tech, or really any sector to get great support even when you're paying for it.

Every time I run into an issue I’m reluctant to reach out to AWS support, because of my default expectation that it will be a terrible waste of time.

Every time, I am also proven wrong as someone competent on their side both actually understands my issue and finds a resolution.

> We have a bug reported back in 2020 that got closed recently without any action because it became stale

One of my pet hates is the (ab)use by repo maintainers of the auto-close-when-stale feature on Github.

What useful purpose does it serve beyond making the repo maintainers look good because they have a low number of open issues ?

It doesn't actually address the issue. Its the virtual equivalent of brushing under the carpet.

I worked in a Digital team 4 years back where the team was building voice channel apps for our customers on both Amazon Alexa and Google Dialogflow. Alexa NLP engine was less sophisticated we had to give it hundreds of prompts and intents. Dialogflow NLP engine required a handful of prompts for the same thing. But when it came to integration with backend APIs and support Alexa was far ahead. Despite having Dialogflow enterprise Google support would suggest to ask in StackOverflow. Amazon support on the other hand was excellent. We needed support for mTLS with the backend APIs, Amazon supported it as they understood enterprise. Google just shooed us away, their support wouldn’t even escalate this.
I don't know. I like GCP. I have been in an Azure centric corporation for close to two years now and I dearly miss GCP almost every day.

My team has a sort of a sandbox where we can use almost any Azure product we want (our IT is supportive and permissive as far as that sandbox goes, which is a blessing), but even then it's just painful in comparison.

AWS is probably better though.

Azure seems like a nightmare, regardless of whether GCP or AWS is better.
There is no way this is true. Only explanation is you work for AWS :-). GCP strength is it's cost. Yes may be the support could be better. But can you care to explain what "hacks" are you talking about ? And the claim that K8S(from Google) is better on AWS than GCP is absolutely false
We are a small startup, 12 strong.

Our reason for going all in with GCP was the k8s. We've been using GCP for 2+ years. The trouble we have is with stability and so many of the features being constantly rolled out.

Our experience was that K8s cost more on GCP than AWS.

Just on LoadBalancers alone, you have tons of tricks that are specific to GCP implementation. And we needed a few extra because you couldn't run all the features we wanted on 1-2 per cluster. For example, we have a 3rd party that required all our requests to always originate and respond back from a fixed IP address. We could only pick one not a range, not a list. This was a hard requirement. The service was important so we had to do it.

It took our team several days to find how to do it using online documentation and support. Tech support was useless. We had one guy in our team that spent 2 days on the phone with a paid, local GCP implementation partner trying to get this problem sorted. Nothing came out of it other than being pitched on our dime a lot of services and architecture we didn't need. Eventually we figure it out on our own. I don't even remember speaking about this when we transitioned to AWS.

Matches my experience. GCP has many better services than AWS but I am not going to run production workload with them after 2 years of experience in previous company. There are so many undocumented quirks that many times you could find better solution from some random person in stackoverflow than highest tier paid support.
That was my experience, too - a couple of things which were better than AWS but this constant stream of paper cuts hitting all of the problems which weren’t cool enough to get someone promoted.
Doesnt help when GKEngine has constant issues and recent upgrade feature is on 10 days strike.

Recently was woken up by alert about DNS resolution issues.

GCP rolled out new version of SkyDNS and NodeLocalDNS, SkyDNS reports 99% miss, had to quickly hack it.

This is not the „out-of-the-box” experience you want to have.

I generally like GCP, however their sales and customer support just aren't any good. And some services like Vertex AI are extremely buggy while it's hard to actually report these bugs.

I think Google Cloud needs someone like Jeff Bezos as their head: Look what your customers actually want and need and understand their requirements. And they usually want good customer support and want a competent key account manager as well.

When we were looking to migrate our analytics database from on-premise to a cloud alternative we were looking at BigQuery and Snowflake. BigQuery is a great product and we were already deeply invested in GCP as well. However the GCP sales team just couldn't sell BigQuery - they just don't know what old corporations want to hear in a sales pitch. So we went with Snowflake in the end. Not because it's the better product but because their sales team is better.

I'm not sure if the cloud business is actually a priority at Google. If it is then I think they don't understand the mistrust Google is facing when it comes to stable long term support of their products.

The horror stories of Google support, across all of their products, is enough for me to never trust GCP. Even if someone told me today "GCP is the exception, they have great support" I probably wouldn't care - they are so organizationally incapable of providing good support that, even if they did so today, I wouldn't believe that it could last.
My experience too, GCP is frustrating. However, there is nothing like BigQuery to me, I love that DB.
Support wise, GCP is a joke run by entitled people. I had an issue some time ago with a VPN and after doing a lot of troubleshooting and having them agree the problem is on their end (packets would go in their VPN Gateway from the VPC, nothing would come out), the solution was to update my configuration on my end to workaround whatever they did because "it is how is going to be"...

TL;DR: they broke something and wouldn't fix it.

Doing business with Google is a liability.
Did you consider your own bare-metal?