Hacker News new | ask | show | jobs
by brandon 1945 days ago
> Nobody at Google has a need to raise a ticket with some ops group in Bengaluru to partition a Kafka topic, renew a certificate, bridge two VPCs, or any of that type of thing.

Except when your team wanted to initially onboard with GOOPS and your request sat in Buganizer for 2 weeks waiting for someone to triage. Uh oh — we're turning down this service next quarter, you will need to go start this onboarding process again with its replacement.

Or when you needed quota in a cell where your product area didn't have Flex. Maybe you can set up a VC with your PARM? Does next week work for your launch plan? Hopefully they can do something for you!

Or when your logs access request sat in GUTS for a month because both of the approvers were on vacation and no, there's not an escalation path.

Or when you needed to change a firewall rule for a project your team inherited which for some reason runs on GCE. Make sure you bring your Ariane link when you open your request. Have ISE reviewed your code? No? ISE currently have a quarter-long backlog, so we're not sure we can grant your firewall exception.

None of these examples are contrived; the weight of the operational bureaucracy is staggering. It may well be that this stuff is felt more on the SRE/Security side around production launches than on the SWE side for experimentation or iterative development, but I struggle with the idea that Google is nimble.

3 comments

Registered account to reply here, because your complaints feel one sided to me.

Most of what you described i felt as well _sometimes_ for security related stuff, like dedicated machines in that one cluster or an ISE review on short notice--but security related is also somewhat out of the norm and considering that is, Google does a great job.

For "normal" services what you described does not match my experience at all. Even for medium sized infrastructure services mostly everything just works (IME).

Never had a GUTS ticket that was not answered within a business day, but obviously just n=1 sample--imo support staff is mostly amazing.

Sure, things get hairy when you go off the beaten path, but day-to-day infrastructure is not the issue. As a user of Google products I don't care as much about developer velocity as I do them shipping swiss cheese products security-wise. If I have to wait a few months more for some new feature, I'll take that trade-off.
Right, the slothful approval process for log retention and access is a feature, not a bug. It's part of the reason why Google's technical privacy story is incomparable.
That was some T7-9 whining right there. Do you think it's easier to get unplanned compute capacity at some other company?
Well, yes. I was provisioning a new service last week and it took me half an hour of clicking buttons in AWS. Without knowing anything about Google, I would have assumed they'd overprovison compute capacity to save developer time at least for smallish requests, since they literally run their own data centres.
They do. When GP says stuff about not having flex in a cell, that essentially means "has not provisioned any quota whatsoever in that zone". Once you do the baseline work to provision some quota, generally speaking you have a somewhat over-provisioned pool to use for whatever.

The need to run in a particular cell is unusual.

More usual is "I need to run in at least three cells in region R". Thankfully, I never faced the "you need to turn up in cell EX tomorrow" without TPM support.
They do.