Hacker News new | ask | show | jobs
by dpryden 1692 days ago
Non-Googler: What do all those words mean?

Noogler: Haha, this video is so funny!

L4 SWE: (Crying because the video is so true)

L5 SWE: Haha, this video is so funny! I should show it to my interns, this will be a good training for them.

L6+ SWE: Why do people think this is funny? This Broccoli Man guy makes some really good points...

4 comments

Where does "these are really good points, but why don't we have tooling which sets everything up automatically?" fit on the scale?
That would be 'Xoogler' because Google's engineering and broader corporate culture does not reward work like that and so when you realize that, you leave.

In general, Googlers have very little idea how far behind the rest of the industry they are when it comes to tooling.

I am a Xoogler.

I got the impression, based on a blog post by Eric Lawrence [1], that Google's developer tooling was top-notch (except for devs working on open-source projects like Chromium). Did it get worse since 2017, or are you talking about a different kind of tooling?

[1]; https://textslashplain.com/2017/02/01/google-chrome-one-year...

Or: These are really good points for a visibly-user-facing post-alpha service, but isn't it a bit overengineered for an experimental internal service whose clients can tolerate the risk of occasional downtime?
L5 Xoogler who left for a startup.
Yeah, that was my reaction. I get the need for all this reliability/failover, but it's horrible failure of abstraction/separation of concerns.

There's no reason the serving team should have to learn how to do all of those things on the checklist, since it can be done by anyone who's already learned the infra. You're expecting them to learn all kinds of stuff outside of their specialty, when they should be able to kick the app over the wall and let infra ensure that the app is deployed in two separate PCR zones with the failover plan etc, which should itself be mostly automated.

Mega-Caps suffer from the following problem:

1. There are more engineers making more divergent architectural solutions such that there is never a single place where you can make changes across the group.

2. Failures keep happening, so process is instituted with many checkboxes for engineers to work through.

3. Engineers on the small scale stuff get stack ranked against the engineers on the big scale stuff. Everyone needs to show that they can do the work and are "fungible". This leads to small internal systems having the same operational standard as large public facing systems.

I don't see what that's replying to. Nothing in that list would justify demanding that the app's team have knowledge or preference about which PCR zones to pick and which will just have to be corrected when they inevitably pick the wrong one.
The point is that every team gets to set their own failure modes. I know of multiple tier-1 services which diverge from at least one best practice.

Think of the scenario where a cloud provider needs to evacuate an az. There is no API which would allow the compute team to force migrate tens of thousands of apps and guarantee that they both are not effected and maintain their redundancy guarantees.

Internal services at google are in the same boat. However google knows about the hard edges and forces everyone to deal with all of that complexity - there is no api which the serving team could plug into which will avoid this overhead.

That still at no point requires the application's team to make decisions about which two PCR zones to pick and which cells within it to pick, which [decision] can still be cleanly abstracted away, and would still be a mixing of unrelated concerns, and so your comments are still orthogonal to the point I was bringing up here.

Edit: It might help to check out my comment here, where I clarify what a dev should vs shouldn't have to worry about: https://news.ycombinator.com/item?id=29085638

While what you say is true, I think GP is ultimately correct. You can have a system define a convention and allow bypassing it, instead of forcing everyone to start from scratch. In fact, this is the approach that pretty much any modern service at Google will use.
> when they should be able to kick the app over the wall and let infra ensure that the app is deployed in two separate PCR zones with the failover plan etc, which should itself be mostly automated

Not entirely - the developers should actively participate in designing the actual failover scenario and making sure the application can handle that (anything from being okay with some downtime due to the failover happening to designing an actual multi-region multi-master application). Making assumptions like 'infra will handle it' is a great way to not only get unexpected outages (because the developers assumed there would be no downtime because failover is magic, or that writes will never be lost) but to also introduce tensions between teams (because you now have an outside team having to wrangle an application into reliability when the original authors don't give a crap about it).

I get and agree with your point, the tooling and processes should definitely be simplified/automated when possible, and developers deserve a working platform that just works. The whole point of a platform team is to abstract away the mundane to let people do their job. But reliability is everyone's job, not just the infra's team, and developers must understand the tradeoffs and technology involved in order to not design broken systems.

If that's the point:

A) It's doing a horrible job conveying it. A dev does need to be concerned on how to handle failover, but only at a certain abstraction level. They should be required to specify something in the form "given server A fails and has to pass to B, what do you do?" That does not require you to know the terminology about PCRs and how to make decisions about which cells (or whatever) to pick on deployment, or avoiding the "gotcha" about making sure the two servers are in different PCR zones.

At that point, it's just following a checklist that needs no knowledge of the specifics of the app, and, to the extent that it's accurately representing how Google was, is indicative of bad processes.

B) Many things should be infra's job, as they're cleanly orthogonal to what dev's are doing. For example, how to apply a security patch to a DB. That's unrelated to the operation of the app.

I do get your point though, and I wouldn't say something like this about e.g. testing (which was the short, "reasonable" part of the video!) -- the devs have intimate knowledge of what counts as passing and failing and should be writing tests, and not 100% passing it over to QA. But that's precisely because such concerns are deeply tied in to the thing they are concerned with. "SQL 3.4.1 vs 3.4.2" is not.

Yeah, it seems like we agree :).
Because you have to get it working before you can make it better. Abstraction is quite secondary
Yes but the video is in the context of a mega-scale mega-corp that should have been able to set up clean abstraction boundaries at this point by now.
They already have done that, this video is 11 years old, at that point Google was half the age it is now and a fraction the size.
Google was still huge in 2010. Everyone seems to think that everything was a hundred percent different just <small number> of years ago...
imho the Google interview process selects for people who thrive on organizational challenges.
I think that was more or less the intended response. And ten years on, most of these things are automated. This video was a kick in the pants internally.
L9+
T7-T9 Vision
Is there a page that documents this anecdote? I’d like to link to it next time i use this phrase. Ironically, googling for it doesn’t turn up anything relevant.
Google has officially apologized. The person in question had to take their blog offline due to bad behavior from readers (unrelated to Google AFAIK). Overall, this is a dark chapter. It's also been scrubbed internally. Not obliterated, but you won't accidentally bump into it as a Noogler.

As much as I think people should take responsibility for their own actions, it's probably for the better to let this one rest now. Who caused it is irrelevant at this point, though. We (Xooglers, and Googlers) can take responsibility for our actions, and not continue perpetuating it.

I use the phrase to describe how leveling is not about what junior people think it’s about. I’m not clear about the responsibility you’re talking about.
> Where does "these are really good points, but why don't we have tooling which sets everything up automatically?" fit on the scale?

My guess it fits nowhere because the L5s don't have the ability to automate it, and the L6s think it's trivial and as it's done sparingly then it doesn't justify the work to do things differently.

And this is why we can't have nice things.

And yet it's been a decade since this video and practically everything it mentions is a non-problem now.

No one is spinning up new borgmon instances. Spanner is replicated by default. Only very low level services need to care about PCRs. If you use one of the approved frameworks it will set up practically all the production configuration for you. Basic alerting for your service is automated, just turn it on, picking cells to run in is automated, scaling your service is automated, etc.

Actually getting quota remains a problem... :-p

Anyway I would argue we can and do have nice things, and that has happened precisely through the efforts of a huge number of people at all levels.

Edit to add: of course, there are always new problems to complain about! It's the march of progress after all.

Yes. If someone were to make this video today, it wouldn't be about production jobs and PCRs, it would be about privacy reviews and branding approvals.

But the quota issues haven't changed a bit.

More like you aren't going to get promoted for automating someone else's toil. Also, now who's going to support it, better deprecate it since the library changed / got deprecated / it's tuesday.
> More like you aren't going to get promoted for automating someone else's toil.

Lots of people were promoted for automating these things. They built easy to use services, got extra headcount since they became important and climbed the ranks. So not sure why you'd think that.

It may be different at other companies, but at Google building stuff that many other engineers depends on is a major way to get promoted. Of course if you automate something and nobody uses your automation tooling then you wont get promoted, but if your work gets used by basically every new engineer you'll climb the ranks quickly.

L7+ SWE

My life is a waste but the money is too good...

This applies to every level, particularly the lower levels.
it ain't much, but it's honest crying into piles of money
How good?
/me falls from the chair.
Not sure if "SWE" stands for software engineer, or "Sweden" as in Stockholm Syndrome
Random synapse activation:

A few years ago there was a Swedish tourist at a hotel where I was on vacation. He had a blue-yellow hat with "SWE" written on it in Courier font. I felt an urge to steal his hat because it looked better than most of the Google-branded swag I got as a Google SWE :)

Why not both? :sob:
Oh definitely Sweden.
Non-Googler: What do all those words mean?

Exactly. This wasn't too relatable, even though I have the GCP Certified Architect cert.

I can't tell if this comment is implying that my comment is unclear, or if you're agreeing with the first line of my comment.

In either case, though, it's an inside joke precisely because it's more relatable to those who are (or were) inside. In particular, I think it would be most funny to someone who was at Google about a decade ago; when I left Google in 2017 things had already changed enough that this didn't ring quite as true for new hires.

That said, GCP is not very representative of what the internal platform looked like circa 2010. (Or even of what the internal platform looks like now, as far as I know.)

I agree that as a non-Googler, I don't get the video, that is all. No negative connotation toward your comment.
Why would internal tooling mean anything to you? And why would GCP knowledge be useful in any way?

Its fairly simple to extract the gist of what these systems from the script.

As an ex Apple person, i'd say it means there's way too much hierarchy at Google? not sure i'm reading it right though
IMO/IME it's the clash between tooling, systems and and processes designed for running long-term highly scalable and reliable services maintained by teams in multiple geographical locations and used by billions of people; and greenfield projects that just want to get things done at an early stage.

Requiring multi-cluster/region, the quota/resource economy system, handling PCRs, code review, readability approval for complex configuration languages (and the existence of such complex languages in the first place) ... all of that makes sense in a vacuum and all were built to handle real problems and are likely written in the blood of a near-miss outage. But it also all comes crashing down on you when you're doing things from scratch for a relatively simple usecase that no-one really designed for.

we still had our processes though. Radar was my least favorite, but they replaced the ant eater app with one that was at least partially usable right before i left.
we'd say, about spoken mad scientist style requests, if it's not in radar, it never existed. :)