| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by danpalmer 839 days ago
	Joining Google a few years ago, one thing I was impressed with is the amount of effort that goes into graceful degradation. For user facing services it gets quite granular, and is deeply integrated into the stack – from application layer to networking. Previously I worked on a big web app at a growing startup, and it's probably the sort of thing I'd start adding in small ways from the early days. Being able to turn off unnecessary writes, turn down the rate of more expensive computation, turn down rates of traffic amplification, these would all have been useful levers in some of our outages.

1 comments

tuyguntn 839 days ago

it's really great to have such capabilities, but adding them has a cost where only few can afford. Cost in terms of investing in building those, which impacts your feature build velocity and the maintenance

link

zerkten 839 days ago

Can you be specific about the cost of building these?

I've run into many situations where something was deemed costly, is found out later, and the team ultimately has implement it all while hoping no one groks that is was predicted. "Nobody ever gets credit for fixing problems that never happened" (https://news.ycombinator.com/item?id=39472693) is related.

link

nostrademons 839 days ago

When I was in Search 15 or so years ago, there was actually a very direct cost: revenue.

The AdMixer was an "optional" response for the search page. If the ads didn't return before the search results did, the search would just not show ads, and Google wouldn't get any revenue for it. Showed the premium that Google of the day put on latency and user experience. I think we lost a few million per year to timeouts, but it was worth it for generating user loyalty, and it put a very big incentive on the ads team to keep the serving stack fast.

No idea if it's still architected like that, I kinda doubt it given recent search experiences, but I thought it was brilliant just for the sake of aligning incentives between different parts of the organization.

link

spacebanana7 839 days ago

The developer, tester and devops time required to properly implement graceful degradation could easily accumulate to hundreds of hours.

Those hours are directly expensive when your developers cost hundreds of dollars a day; and have a material opportunity cost in that their commitment to one particular project delays the delivery of other features.

Moreover, any new features would have to be made compatible with the graceful degradation pattern, creating an ongoing cost.

link

IggleSniggle 839 days ago

When you hire an engineer to build a dam, you expect them to consider piping and subsurface flows such that the foundation isn't swept out in a decade. No matter of the engineer was already paid, retired, etc.

My point isn't that we all need to make dams that can hold up for a century. The point is that you hire an engineer because you want someone with the judgement and expertise to apply the correct amount of engineering to any given solution. Over-engineering is on the pathway to correct-sized engineering. It's the experience, discovery, and exploration required to arrive at choosing what things actually do not need to be done.

When your manager asks you, "do we really need to do that?" It's the expert that can explain why it really is necessary, and the professional who accepts "we're not going to do that" as an answer. And if they still feel it would be harmful not to do it, then that's where professional duty kicks in.

link

mlyle 839 days ago

There's a lot of levels to the approach.

Just spending a few moments to consider whether queues should grow, block, or spill when adding them makes a big difference, along with choices in error handling. You can get a lot of things to gracefully degrade for free if that's a part of your decision-making process.

link

jacobcoro 839 days ago

Could be as simple as just some feature flags with environment variables

link

groestl 839 days ago

I also found that when building a feature iteratively, with feature flags for rollout, a simple feature degradation path often appears natively.

link

rkagerer 839 days ago

For one, it potentially multiplies the testing and regression testing requirements to hit all those additional configurations.

link

kortilla 839 days ago

Effectively every piece of software written for at most a few thousand people to use concurrently (i.e. 99.99% of software).

Consumer apps that scale to hundreds of thousands of users with five 9s+ uptime requirements are very rare.

link

danpalmer 839 days ago

So at my previous place we had a monolith with roughly 700 different URL handlers. Most of the problem with things like this was understanding what they all did.

Applying rate limiting, selective dropping of traffic, even just monitoring things by how much they affect the user experience, all require knowing what each one is doing. Figuring that out for one takes very little time. Figuring it out for 700 made it a project we'd never do.

The way I'd start with this is just by tagging things as I go. I'd build a lightweight way to attach a small amount of metadata to URL handlers/RPC handlers/GraphQL resolvers/whatever, and I'd decide a few facts to start with about each one – is it customer facing, is it authenticated, is it read or write, is it critical or nice to have, a few things like that. Then I'd do nothing else. That's probably a few hours of work, and would add almost no overhead

Now when it comes to needing something like this, you've got a base of understanding of the system to start from. You can incrementally use these, you can incrementally enforce that they are correct through other analysis, but the point is that I think it's low effort as a starting point with a potentially very high payoff.

link