Hacker News new | ask | show | jobs
by SilasX 1692 days ago
Yeah, that was my reaction. I get the need for all this reliability/failover, but it's horrible failure of abstraction/separation of concerns.

There's no reason the serving team should have to learn how to do all of those things on the checklist, since it can be done by anyone who's already learned the infra. You're expecting them to learn all kinds of stuff outside of their specialty, when they should be able to kick the app over the wall and let infra ensure that the app is deployed in two separate PCR zones with the failover plan etc, which should itself be mostly automated.

3 comments

Mega-Caps suffer from the following problem:

1. There are more engineers making more divergent architectural solutions such that there is never a single place where you can make changes across the group.

2. Failures keep happening, so process is instituted with many checkboxes for engineers to work through.

3. Engineers on the small scale stuff get stack ranked against the engineers on the big scale stuff. Everyone needs to show that they can do the work and are "fungible". This leads to small internal systems having the same operational standard as large public facing systems.

I don't see what that's replying to. Nothing in that list would justify demanding that the app's team have knowledge or preference about which PCR zones to pick and which will just have to be corrected when they inevitably pick the wrong one.
The point is that every team gets to set their own failure modes. I know of multiple tier-1 services which diverge from at least one best practice.

Think of the scenario where a cloud provider needs to evacuate an az. There is no API which would allow the compute team to force migrate tens of thousands of apps and guarantee that they both are not effected and maintain their redundancy guarantees.

Internal services at google are in the same boat. However google knows about the hard edges and forces everyone to deal with all of that complexity - there is no api which the serving team could plug into which will avoid this overhead.

That still at no point requires the application's team to make decisions about which two PCR zones to pick and which cells within it to pick, which [decision] can still be cleanly abstracted away, and would still be a mixing of unrelated concerns, and so your comments are still orthogonal to the point I was bringing up here.

Edit: It might help to check out my comment here, where I clarify what a dev should vs shouldn't have to worry about: https://news.ycombinator.com/item?id=29085638

While what you say is true, I think GP is ultimately correct. You can have a system define a convention and allow bypassing it, instead of forcing everyone to start from scratch. In fact, this is the approach that pretty much any modern service at Google will use.
> when they should be able to kick the app over the wall and let infra ensure that the app is deployed in two separate PCR zones with the failover plan etc, which should itself be mostly automated

Not entirely - the developers should actively participate in designing the actual failover scenario and making sure the application can handle that (anything from being okay with some downtime due to the failover happening to designing an actual multi-region multi-master application). Making assumptions like 'infra will handle it' is a great way to not only get unexpected outages (because the developers assumed there would be no downtime because failover is magic, or that writes will never be lost) but to also introduce tensions between teams (because you now have an outside team having to wrangle an application into reliability when the original authors don't give a crap about it).

I get and agree with your point, the tooling and processes should definitely be simplified/automated when possible, and developers deserve a working platform that just works. The whole point of a platform team is to abstract away the mundane to let people do their job. But reliability is everyone's job, not just the infra's team, and developers must understand the tradeoffs and technology involved in order to not design broken systems.

If that's the point:

A) It's doing a horrible job conveying it. A dev does need to be concerned on how to handle failover, but only at a certain abstraction level. They should be required to specify something in the form "given server A fails and has to pass to B, what do you do?" That does not require you to know the terminology about PCRs and how to make decisions about which cells (or whatever) to pick on deployment, or avoiding the "gotcha" about making sure the two servers are in different PCR zones.

At that point, it's just following a checklist that needs no knowledge of the specifics of the app, and, to the extent that it's accurately representing how Google was, is indicative of bad processes.

B) Many things should be infra's job, as they're cleanly orthogonal to what dev's are doing. For example, how to apply a security patch to a DB. That's unrelated to the operation of the app.

I do get your point though, and I wouldn't say something like this about e.g. testing (which was the short, "reasonable" part of the video!) -- the devs have intimate knowledge of what counts as passing and failing and should be writing tests, and not 100% passing it over to QA. But that's precisely because such concerns are deeply tied in to the thing they are concerned with. "SQL 3.4.1 vs 3.4.2" is not.

Yeah, it seems like we agree :).
Because you have to get it working before you can make it better. Abstraction is quite secondary
Yes but the video is in the context of a mega-scale mega-corp that should have been able to set up clean abstraction boundaries at this point by now.
They already have done that, this video is 11 years old, at that point Google was half the age it is now and a fraction the size.
Google was still huge in 2010. Everyone seems to think that everything was a hundred percent different just <small number> of years ago...
> <small number> of years ago

Half a companies lifetime isn't a "<small number> of years ago" for that company. You can't compare tech ecosystems today to those in 2010, so many things has gotten standardized since then, Google was at the forefront back then.

Unlike modern companies Google had to build out everything themselves since nobody had built those systems or even had experience building such systems. That takes time, but today all of the things Google learned is common knowledge in papers and similar.

If you disagree mention one company that had a one button script that abstracted away things like where the data is stored to ensure failsafe, data replication etc, in 2010. I don't think there was any, just the fact that Google made it relatively easy to launch such services, just that you had to manually configure the replication script and the zones your data should be stored in wasn't really a big deal.