| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by efitz 1653 days ago

You don’t know what you’re talking about.

AWS spends a lot of time thinking about this problem in service to their customers.

How do you reduce the status of millions of machines, the software they run, and the interconnected-ness of those systems to a single graphical indicator?

It would be dumb and useless to turn something red every single time anything had a problem. Literally there are hundreds of things broken every minute of every day. On-call engineers are working around the clock on these problems. Most of the problems either don’t affect anyone due to redundancy or affect only a tiny number of customers- a failed memory module or top-of-rack switch or a random bit flip in one host for one service.

Would it help anyone to tell everyone about all these problems? People would quickly learn to ignore it as it had no bearing on their experience.

What you’re really arguing is that you don’t like the thresholds they’ve chosen. That’s fine, everyone has an opinion. The purpose of health dashboards like these are mostly so that customers can quickly get an answer to “is it them or me” when there’s a problem.

As others on this thread have pointed out, AWS has done a pretty good job of making the SHD align with the subjective experience of most customers. They also have personal health dashboards unique to each customer, but I assume thresholding is still involved.

2 comments

Karunamon 1653 days ago

>How do you reduce the status of millions of machines, the software they run, and the interconnected-ness of those systems to a single graphical indicator?

A good low-hanging fruit would be, when the outage is significant enough to have reached the media, you turn the dot red.

Dishonesty is what we're talking about here. Not the gradient when you change colors. This is hardly the first major outage where the AWS status board was a bald-faced lie. This deserves calling out and shaming the responsible parties, nothing less, certainly not defense of blatantly deceptive practices that most companies not named Amazon don't dip into.

efitz 1651 days ago

Human-in-the-loop != lying.

Broken dashboard != lying.

The specific charge of “lying” is what I dispute.

toss1 1652 days ago

>>How do you reduce the status of millions of machines, the software they run, and the interconnected-ness of those systems to a single graphical indicator?

There's a limitless variety of options, and multiple books written about it. I can recommend the series "The Visual Display of Quantitative Information" by Edward Tufte, for starters.

>> Literally there are hundreds of things broken every minute of every day. On-call engineers are working around the clock...

Of course there are, so a single R/Y/G indicator is obviously a bad choice.

Again, they could at any time easily choose a better way to display this information, graphs, heatmaps, whatever.

More importantly, the one thing that should NOT be chosen is A) to have a human in the loop of displaying status, as this inserts both delay and errors.

Worse yet, to make it so that it is a VP-level decision, as if it were a $1million+ purchase, and then to set the policy to keep it green when half a continent is down... ummm that is WAAAYYY past any question of "threshold" - it is a premeditated, designed-in, systemic lie.

>>You don’t know what you’re talking about. Look in the mirror, dude. While I haven't worked inside AWS, I have worked in complex network software systems and well understand the issues of thousands of HW/SW components in multiple states. More importantly, perhaps it's my philosophy degree, but I can sort out WHEN (e.g., here) the problem is at another level altogether. It is not the complexity of the system that is the problem, it is the MANAGEMENT decision to systematically lie about that complexity. Worse yet, it looks like those lies on an everyday basis are what goes into their claims of "99.99+% uptime!!" evidently false. The problem is at the forest level, and you don't even want to look at the trees because you're stuck in the underbrush telling everyone else they are clueless.