Hacker News new | ask | show | jobs
by mlthoughts2018 1981 days ago
I also have significant experience in all 3 and I couldn’t disagree more. GCP support & documentation alone is a dramatic reason to avoid GCP. GCS CLI utilities are supposed to be S3 API-compatible, but they are not. GCP keyfile-based access is a horrid anti-pattern, but the rules for human IAM user vs service account vs impersonation are not uniform across all products (eg, if you need developers to have ad hoc non-console access to both GCE VMs and Dataproc clusters, you have to manage two very different approaches to identity-based access).

GCP’s region-level SLA are poor for most products and over a window of a few years, they don’t actually meet their region SLAs. GCP has all kinds of nasty legalese about “beta” features that aren’t supported by the SLAs, and if you use them, you forfeit your right to claim credits after SLA-violating outages. For GKE in particular, Google’s rules basically exclude every aspect of Kubernetes you need to actually use it in production, which is a blatant attempt to force users into Anthos.

In machine learning in particular, GCP has horrible offerings that are massively over-priced and/or are 100% hype-driven (TPUs are a good example, but also things like running Kubeflow or Feast).

Google Cloud Functions and Google Cloud Run have such severe limitations to resource sizing, especially memory, that they are irrelevant, whereas by comparison Fargate is excellent for ML workloads. There really is no equivalent in GCP, since Cloud Run can’t handle large Docker containers needing high RAM, so you’ll just be rerouted to GKE where because of the SLA legalese you can’t actually use any of the tools you want. And then on top of this, configuring any type of hybrid open internet / internal data center service with Cloud Functions or Cloud Run is miserable. You need a full Networking team just solely to manage Cloud Function or Cloud Run service access, it is absolutely nowhere close to self-service for normal backend teams.

GCP is a miserable, miserable choice for cloud vendor. It is typically chosen solely due to being cheap in the short term and allowing bulk deals on GSuite, Ads credits and other deal sweeteners. It’s so stupid to choose GCP for these short-term deals, because Google absolutely will lock you in and raise prices for their garbage tools and poor customer service.

For my money both Azure and AWS are still lightyears ahead of GCP and I would gladly pay a premium to use either just to avoid GCP.

1 comments

> For GKE in particular, Google’s rules basically exclude every aspect of Kubernetes you need to actually use it in production

I’ve been using GKE in prod for ~4 years, and have never needed beta features. What beta features do you think are required?

I’ve also always been able to assign permissions to a user, group, or service account. When have you not been able to do so?

I think you are probably confused. There are many beta features in Kubernetes and they are enabled by default. For a long time, very critical features could remain beta for years, such as all of Ingress and all of CronJob.

- https://kubernetes.io/blog/2020/08/21/moving-forward-from-be...

One of the big issues with the GKE SLA is that your organization must only consume Kubernetes from the Stable channel, but given the large amount of enabled-by-default, critical beta features, many (probably most) production deployments of Kubernetes rely on non-stable channels intentionally for the sake of critical beta features that have been de facto production features. In my company for example, we must run a slightly older version of Kubernetes and upgrades are very slow, so we are way behind the stable channel with no way to upgrade fast unless the stable channel supports all sorts of enabled-by-default beta features and older versions. So we could never run a hybrid cloud with GKE, it would violate the SLA restrictions from first principles. This has created a nasty, painful rift in my org between the on-prem Kubernetes and the cloud (essentially useless) GKE Kubernetes.

Beyond this there are other features that are critical and uncovered, like multi-region ingress for example. We operate some very very large data ingestion services for customers and we absolutely need a higher SLA uptime on it than what a single region offers in GCP. So we have to operate multi-region ingress, but all of the non-Anthos solutions are no longer supported by GCP, and void out the individual region SLAs. It’s madness.

On top of all this, Google does not actually publish clear lists of features that are or are not covered by the SLA. The way it’s worded relies solely on the Kubernetes alpha / beta / GA channels, but nothing actually ties Google legally to that. They can arbitrarily define the SLA terms to mean whatever they want it to mean at any time. While you likely can’t avoid a cloud provider with that freedom, at least you could expect them to actually publish and document it.

> I’ve also always been able to assign permissions to a user, group, or service account. When have you not been able to do so?

Please check again in my comment. I mentioned specific examples (user-based, not service account-based, workflows in Dataproc, for example), where it’s not possible in GCP, as in the product itself disallows it. It’s not an issue of me or you or anyone being able to create IAM policy or service accounts. It’s an issue that different products within GCP fundamentally disallow some auth workflows (like Dataproc cluster creation being fundamentally disallowed for user-based auth workflows) that then force you to manage multiple different auth flow patterns even within the same user workflow (for example, user-based auth flows for GCE VMs but impersonating service accounts for the exact same steps for a Dataproc cluster), leading to much more overhead, more inscrutable errors, more round trips through security approval. The issue is the poorness of the product design, not some general inability for a user to figure out a service account.

I was assuming you meant GKE beta features, not k8s beta, since the latter is ridiculous, can you point me to the bit of the SLA you’re referring to here?