| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dyanacek 1709 days ago

This is a great way to describe it! I gave a similar example of pagination and how the later pages might be better to prioritize over initial pagination requests, but your example is a nicer illustration. Thanks for that!

There’s also someone I was talking to after writing the article who said they can fall back to statically rendered versions of certain pages on Amazon.com during overload. The trick is to have a page that is still useful!

And for the “turning off features” idea - this happens today on Amazon.com. If a feature on the site fails to render successfully or on time, it’s left off of the page. Critical functionality can be left off, so it’s a judgement call on what’s allowed to fail the page render.

1 comments

tyingq 1709 days ago

Ah, yes, you're right...I missed the pagination example fitting that pattern.

"If a feature on the site fails to render successfully or on time, it’s left off of the page. Critical functionality can be left off, so it’s a judgement call on what’s allowed to fail the page render."

Oh, that's useful also, but I meant a step farther where the page doesn't ask for those widgets if (load > X). Which avoids calling it at all.

dyanacek 1709 days ago

Good point around avoiding the call in the first place. This is a very tricky topic, I’ve found. Things that try to guess the nuanced health of a dependency can lead to outages when they guess wrong. These circuit breakers are helpful if they’re right, but harmful if they’re wrong.

For example, say a service is backed by a partitioned cache cluster, where the data is hashed to a particular cache node. Now let’s say one node has a problem, causing requests to data that lives on that node to fail, but others to succeed. If a client is making requests for data that happens to live on all nodes (the client doesn’t know about these nodes by the way, it’s just an implementation detail of the service) and sees an increased error rate, should it start failing some requests? It could take a single partition outage and increase the scope of impact into a full outage.

Anyway I’ve been meaning to write an Amazon Builders’ Library article on this topic, or to convince someone else to do it (looking at you, Marc Brooker!)