(I work for Azure, so I don't really have a dog in this fight.)
Google's cloud seems pretty capable, they're just going through teething. They got to live migration extremely fast, and they seem to have nailed BigQuery and Spanner (albeit not the pricing model.) They will get more reliable.
Internally though, Google was (and maybe still is) a decade ahead. With Borg they deployed their entire infrastructure as containers, running at insanely high utilization via Heracles, and built homogeneous and highly-scalable networking. They invented synchronous georeplication with Spanner. From all accounts, it's a paradise.
It just turned out they were too greedy with keeping their innovations to themselves, too egotistical to believe the rest of the world would catch up, and it was too difficult to productize their infra.
For instance, their containers can only rely on API filters for security, so they couldn't bring Borg to the masses, other than cutting it down dramatically to make Kubernetes and requiring it to run on VMs. It looks like Spanner is also priced by the core, which implies it also can't be safely/compliantly commingled with their internal instances.
Still, for a company playing catch-up and needing to completely change their mindset (remember that initially they didn't even offer VMs), they're doing darn well. Especially now that they've pivoted to GKE/big data/ML and targeting multi-cloud or niche clients rather than enterprises.
Google deserves fame for its infra, it just didn't have good business sense when it came to cloud.
The road to higher reliability is a slow one when we are talking about cloud-level 9s. I get the feeling they are still in the early stages of adopting a B2B culture and everything that comes with that.
For example, the June global network outage was caused by a software bug that had global implications. AWS moved away from building software that crossed region boundaries years ago, I would guess because they had an issue like this one. It feels like GCP is many mistakes behind AWS.
Cloud reliability isn't due to genius, it's just software+infrastructure that has been improved again and again in response to new failure modes. I think it will be several years before GCP has reliability that is comparable to what AWS has today.
Google's cloud seems pretty capable, they're just going through teething. They got to live migration extremely fast, and they seem to have nailed BigQuery and Spanner (albeit not the pricing model.) They will get more reliable.
Internally though, Google was (and maybe still is) a decade ahead. With Borg they deployed their entire infrastructure as containers, running at insanely high utilization via Heracles, and built homogeneous and highly-scalable networking. They invented synchronous georeplication with Spanner. From all accounts, it's a paradise.
It just turned out they were too greedy with keeping their innovations to themselves, too egotistical to believe the rest of the world would catch up, and it was too difficult to productize their infra.
For instance, their containers can only rely on API filters for security, so they couldn't bring Borg to the masses, other than cutting it down dramatically to make Kubernetes and requiring it to run on VMs. It looks like Spanner is also priced by the core, which implies it also can't be safely/compliantly commingled with their internal instances.
Still, for a company playing catch-up and needing to completely change their mindset (remember that initially they didn't even offer VMs), they're doing darn well. Especially now that they've pivoted to GKE/big data/ML and targeting multi-cloud or niche clients rather than enterprises.
Google deserves fame for its infra, it just didn't have good business sense when it came to cloud.