Hacker News new | ask | show | jobs
by danpalmer 839 days ago
Joining Google a few years ago, one thing I was impressed with is the amount of effort that goes into graceful degradation. For user facing services it gets quite granular, and is deeply integrated into the stack – from application layer to networking.

Previously I worked on a big web app at a growing startup, and it's probably the sort of thing I'd start adding in small ways from the early days. Being able to turn off unnecessary writes, turn down the rate of more expensive computation, turn down rates of traffic amplification, these would all have been useful levers in some of our outages.

1 comments

it's really great to have such capabilities, but adding them has a cost where only few can afford. Cost in terms of investing in building those, which impacts your feature build velocity and the maintenance
Can you be specific about the cost of building these?

I've run into many situations where something was deemed costly, is found out later, and the team ultimately has implement it all while hoping no one groks that is was predicted. "Nobody ever gets credit for fixing problems that never happened" (https://news.ycombinator.com/item?id=39472693) is related.

When I was in Search 15 or so years ago, there was actually a very direct cost: revenue.

The AdMixer was an "optional" response for the search page. If the ads didn't return before the search results did, the search would just not show ads, and Google wouldn't get any revenue for it. Showed the premium that Google of the day put on latency and user experience. I think we lost a few million per year to timeouts, but it was worth it for generating user loyalty, and it put a very big incentive on the ads team to keep the serving stack fast.

No idea if it's still architected like that, I kinda doubt it given recent search experiences, but I thought it was brilliant just for the sake of aligning incentives between different parts of the organization.

The developer, tester and devops time required to properly implement graceful degradation could easily accumulate to hundreds of hours.

Those hours are directly expensive when your developers cost hundreds of dollars a day; and have a material opportunity cost in that their commitment to one particular project delays the delivery of other features.

Moreover, any new features would have to be made compatible with the graceful degradation pattern, creating an ongoing cost.

When you hire an engineer to build a dam, you expect them to consider piping and subsurface flows such that the foundation isn't swept out in a decade. No matter of the engineer was already paid, retired, etc.

My point isn't that we all need to make dams that can hold up for a century. The point is that you hire an engineer because you want someone with the judgement and expertise to apply the correct amount of engineering to any given solution. Over-engineering is on the pathway to correct-sized engineering. It's the experience, discovery, and exploration required to arrive at choosing what things actually do not need to be done.

When your manager asks you, "do we really need to do that?" It's the expert that can explain why it really is necessary, and the professional who accepts "we're not going to do that" as an answer. And if they still feel it would be harmful not to do it, then that's where professional duty kicks in.

There's a lot of levels to the approach.

Just spending a few moments to consider whether queues should grow, block, or spill when adding them makes a big difference, along with choices in error handling. You can get a lot of things to gracefully degrade for free if that's a part of your decision-making process.

Could be as simple as just some feature flags with environment variables
I also found that when building a feature iteratively, with feature flags for rollout, a simple feature degradation path often appears natively.
For one, it potentially multiplies the testing and regression testing requirements to hit all those additional configurations.
Effectively every piece of software written for at most a few thousand people to use concurrently (i.e. 99.99% of software).

Consumer apps that scale to hundreds of thousands of users with five 9s+ uptime requirements are very rare.

So at my previous place we had a monolith with roughly 700 different URL handlers. Most of the problem with things like this was understanding what they all did.

Applying rate limiting, selective dropping of traffic, even just monitoring things by how much they affect the user experience, all require knowing what each one is doing. Figuring that out for one takes very little time. Figuring it out for 700 made it a project we'd never do.

The way I'd start with this is just by tagging things as I go. I'd build a lightweight way to attach a small amount of metadata to URL handlers/RPC handlers/GraphQL resolvers/whatever, and I'd decide a few facts to start with about each one – is it customer facing, is it authenticated, is it read or write, is it critical or nice to have, a few things like that. Then I'd do nothing else. That's probably a few hours of work, and would add almost no overhead

Now when it comes to needing something like this, you've got a base of understanding of the system to start from. You can incrementally use these, you can incrementally enforce that they are correct through other analysis, but the point is that I think it's low effort as a starting point with a potentially very high payoff.