Hacker News new | ask | show | jobs
by tpolm 2004 days ago
it is interesting that is both cases (recent gmail and this one) it was a "migration":

"As part of an ongoing migration of the User ID Service to a new quota system"

"An ongoing migration was in effect to update this underlying configuration system"

it was not a new feature, not a massive hardware failure, it was migrating part of the working product due to some unclear reason of "best practices".

both of those migrations failed with symptoms suggesting that whoever was performing them did not have deep understanding of systems architecture or safety practices and there was no one to stop them from failing. Signs of slow degradation of engineering culture at Google. There will be more to come. Sad.

4 comments

> it was migrating part of the working product due to some unclear reason of "best practices".

I want to push back on that. Of course, the reasons are unclear to an outsider.

Migrations are an unavoidable in any system that is still evolving (i.e. not dead). Old designs turn out to be too limited or too slow for an evolved use case, so you migrate them to a new service or a new data structure.

If you try to avoid migrations by building The Perfect Things[tm] upfront, you get lost in overengineering instead.

In my own work, I do migrations with some regularity, and they all have a clear goal, it's never what you call 'some unclear reason of "best practices"'.

I think you're on to something. One of the challenges that teams at Google have is service dependencies. In theory, Google is one big, happy family and everyone is responsible for everyone's code. In practice, teams have focuses, software interdepends and interoperates, and mistakes get made at the margin where the linkages between two software systems are neither team's direct responsibility---or the responsibility of both teams.

It's not malice, it's incentives and information flow. Integrating with a service that one is not responsible for, one can get tripped by unknown unknowns that the team that maintains the service has failed the document. And while a migration mistake is embarrassing, software engineering teams are generally rewarded for task completion, not for the time spent preparing for a failure that doesn't occur.

Really, universal problems that all large corporations and bureaucracies are vulnerable to. Overlaps or gaps in responsibility (breeding plausible deniability), lack of communication, and management issues (hence dysfunctional incentives) are difficult to root out and cost a lot of money to track down and fix.
It's expected the outrage was due to a migration, the majority of test is design to cover new feature/bug fix.

A migration is way more complex, it can touch multiple different components and at Google scale that mean different teams, it can be impossible to test this kind of migration without having a testing platforms as big as prod and if Google do it, maybe all gcp will not be enough.

This "unclear" reason can hide a bigger issue like a security bug fix, or just an important migration to go somewhere.

It look like covid19 hit everything, since it begins, attack increases a lot and security need to step up, really fast. Some manager have trouble to handle the full remote situation, some engineers have trouble too, all that combined can create small hole, so some outrage.

How many migration they do way more complex than this one without issue?

I don't think engineering culture is degrading, but shit happen and in extreme situations we see problems easier.

if this outrage was due by a feature or "massive hardware" , we can be very worrie about it

>both of those migrations failed with symptoms suggesting that whoever was performing them did not have deep understanding of systems architecture or safety practices and there was no one to stop them from failing.

Can any single person at Google have a full understanding of all the dependencies for even a single system? I have no idea, as I've never worked there, but I would imagine that there is a lot of complexity.

somehow they managed to build complex systems like gmail, continuously develop new features there and not have massive outages due to "migrations" - suggests that something that they were doing right, they are no longer able to do
I'm pretty sure Google has had occasional severe outages for their whole history.
A single event is not data.
there were more than two recent incidents lately:

- YouTube outage this November 2020

- August 2020 outage of Google Suite including Gmail

in both cases no postmortems were published

Afaik, Google doesn't publish public PMs for non-paid offerings, so youtube doesn't get a public pm.

For the August outage, I believe there was a public pm. That said I can't find it now (I think there was some link rot somewhere, and I've escalated about that).

Two events in a single day.
To answer your question, the answer is yes. Some people do understand the dep stack. Takes years but hey there are lifers.