Hacker News new | ask | show | jobs
by alectroem 979 days ago
Wow, I literally did a full cluster version upgrade last night without knowing about this. I would have delayed the upgrade if I had known GKE was failing for "a small number of customers"

I wish cloud providers would just communicate outages to services I use like this to me!

9 comments

> communicate outages to services I use

In general, they seem bad at communicating relevant information. Just looking over at emails from Google, every single one of the last 10 emails they sent me was not relevant to me specifically.

> "[Important notice] Tax changes in Nepal" (yet I have never made a sale in Nepal)

> "Secure your Google Admin account with these best practices" (yet I already do all those things)

> "Preparing for the upcoming Google Ads change on October 31, 2023" (it's all about mediation waterfalls, and I'm 90% sure I don't use them, and 100% sure I don't know what they are)

AWS got this very right with the per-account incident dashboard.
Except when their systems don't correctly discern what services you may be using...
I haven't had that comms issue with Google but I have to say, even though I prefer GCP to AWS in terms of user friendliness, it is far too often the case that you find exactly the solution you need only to learn it's deprecated in favour of a less useful alternative.
They do publish an RSS feed for the status page, but there is no direct way to get notified AFAIK. I used to create a Slack notifier using IFTTT.
For infra folks, I always suggest having a Slack channel with RSS feeds from vendor incident sites.

Slack let’s you subscribe directly to an RSS feed using /feed.

Don't put this in a special channel just for vendor incidents. Hopefully you have a channel for each vendor tool where you have a vendor representative present. You put that bot in there. It's much more likely to be noticed and much less likely to be a channel that everyone ignores because it's SNR is to low.
"vendor representative" ? Internal or external contact ?

Sounds like a good approach nonetheless

external, not every vendor will do this but if you are big enough and they are big enough it never hurts to ask.
Yeah, we did do this when we got large enough for a shared slack channel with aws. But for most orgs, just having this piped to your alerts channel is good enough.
Yes, I used to do that for my teams. Most infra vendors have RSS feeds for their public status pages.
Thats a really good idea!
You can also use the (pre-GA) Service Health API to get alerts specifically for the regions and services you use. It's pretty nice!

https://cloud.google.com/service-health/docs/overview#how-pe...

There is a 'Google Cloud Service Health Updates' Slack app that has notifications about this incident. Here's what it looks like on our channel:

    8:18 PM
    APP UPDATE: Global: Google Kubernetes Engine Nodepool Upgrade Failures
    Incident began at 2023-10-02 11:29 (all times are US/Pacific).Summary: Global: Google Kubernetes Engine Nodepool Upgrade Failures
    Description: A mitigation has been rolling out and we are assessing its effectiveness. We will provide an update byTuesday 2023-10-10 12:00 US/Pacific with current details.
    Diagnosis: A small number of customers are experiencing failed nodepool upgrades. Customers experiencing this,  may see "Internal error" in Google Cloud Console. Retrying is suggested but may...

There are quite a lot of alerts about various issues.
I will say CircleCI’s dashboard makes it impossible to NOT know there is an outage going on by putting it in the sidebar. Unless it’s collapsed you’ll be aware of everything breaking (that they report) to the point where it feels like everything is breaking all the time
If you've ever worked at a cloud provider, then you know everything is breaking all the time. The good ones are just able to hide it most of the time/for most customers.
I wish I had a better answer but:

When you suspect things are broken, check X (FKA Twitter).

Other devs will be talking about it before there's an official status page.

… if only the site formerly known as Twitter wasn't so hostile to being checked these days.

If we as an industry can't think of something better (cough honest status pages cough) … can we at least transition these tweets to Mastodon.

Honest status page are a really really hard problem.

If your connector between your status monitor and the service breaks you'll have some subset of users panicking and causing problems (or asking for refunds for outages) when the service was up the entire time.

3rd party services are the only ones that you'll get a "more honest" but not always correct view of what the actual status is.

Even if they just updated it manually for major incidents, it would still be useful.
Define major incident. Defines update.
When an incident is declared, have someone tasked with determining customer impact. If the impact radius is greater than a handful of customers, declare a public incident. If customer communication is made a priority, then you can actually have a helpful status page.

Where I work, just about any non-false-alarm incident ends up on the status page in a timely manner. There's nothing stopping the likes of AWS from doing the same except for culture.

Why? It's just a judgment call
> … if only the site formerly known as Twitter wasn't so hostile to being checked these days.

You can easily check it manually. Search for "EBS" and see a bunch of people talking about EBS timeouts or whatever. That was what I was getting at.

But yeah scraping is harder now.

I'm not talking about scraping, I'm talking about manual usage.

> You can easily check it manually.

You cannot. Twitter's site is plagued by redirect loops. If you work around those, these days /search just redirs to the login page. You can view single tweets, but there won't be any replies. (I have no idea if the site formerly known as Twitter is still rate-limiting views, or if they canned that.)

It is unusable if you're not actively logged in, and some of us have no desire to give away a phone number just to see AWS's true status.

Oh. I use Twitter at least every waking hour and haven't seen that.
looks like it's only affecting clusters on 1.24. If you upgraded it was likely to 1.27
The status page became political.
Sorry what do you mean by this?
Any metric that becomes a target ceases to be a good metric. https://health.aws.amazon.com/health/status

See all those green there? Once it started becoming "monitored" by VPs instead of the software engineers on call, they started to become political. I bet there are several sev2s happening for several of those services even as we speak but it still shows green to the outside observer. If one has access to the actual metrics for those services, i bet we would see a different story than what is shown on the "status" page.

There was something probably here a few days ago to the effect of 'Their 9s are not your 9s'. Like yes, their status is showing an error rate of less than .00001%. However, all of those errors are being generated by your 5 instances that are completely down.
I think they mean that companies rarely update their status pages to reflect reality (for instance, AWS outages are rarely shown on their own status pages). This is often by design, company policy, or a desire to save face.
And it's so incredibly dumb. Companies need to get it through their thick heads that this is so incredibly short-sighted.

Not once has a status page that's devoid of information or slow to update ever saved face. I am far more annoyed to have to continue to verify "no, it is indeed that your service is down, not mine" and then file a support ticket. I am triply annoyed if the response from support is "ah yes that's a known problem and we're working on it" — known, and you just didn't bother to communicate.

I miss the days when Github had graphs. Even if they simply hadn't had the time to put a message on the page, you could tell from the graphs that it was Github. But even with "more information" that some PM might not like being put out publicly, Github felt more reliable & stable in those days.

At the end of the day, no amount of political gamesmanship will save you from having to actually run a reliable service, and gamesmanship just makes it more likely I'll ascribe false positives to your service, further lowering my perception of its reliability.

It's so watered down that the "AWS" emoji in our Slack instance is literally a meme of the status page.

I bet as long as the status page is not updated, it is not taken into account when calculating quarterly or yearly uptime statistics.

I am sure that counts. Probably tied into someone's bonus as well.

> Probably tied into someone's bonus as well.

very likely not just one.

not sure how this could work, as some qa/sre would have to be paid even more in order to effectively work against this type of falsification. there would have to be a very strong incentive for the company to do that.