Hacker News new | ask | show | jobs
by romanhotsiy 1652 days ago
It's funny that the first place I go to learn about the outage is Hacker News and not https://status.aws.amazon.com/ (it's still reports everything to be "operating normally"...)
5 comments

I made sure our incident response plan includes checking Hacker News and Twitter for actual updates and information.

As of right now, this thread and one update from a twitter user, https://twitter.com/SiteRelEnby/status/1468253604876333059 are all we have. I went into disaster recovery mode when I saw our traffic dropped to 0 suddenly at 10:30am ET. That was just the SQS/something else preventing our ELB logs from being extracted to DataDog though.

So as of the time you posted this comment, were other services actually down? The way the 500 shows up, and the AWS status page, makes it sound like "only" the main landing page/mgt console is unavailable, not AWS services.
Yes, they are still publishing lies on their status page. In this thread people are reporting issues with many services. I'm seeing periodic S3 PUT failures for the last 1.5 hours.
AWS services are all built against each other so one failing will take down a bunch more which take down more like dominos. Internally there’s a list of >20 “public facing” AWS services impacted.
I always got the impression that downdetector worked by logging the number of times they get a hit for a particular service and using that as a heuristic to determine if something is down. If so, that's brilliant.
It's brilliant until the information is bad.

When Facebook's properties all went down in October, people were saying that AT&T and other cell phone carriers were also down - because they couldn't connect to FB/Insta/etc. There were even some media reports that cited Downdetector, seeming without understanding that they are basically crowdsourced and sometimes the crowd is wrong.

I think it's a bit simpler for AWS- there's a big red "I have a problem with AWS" button on that page. You click it, tell it what your problem is, and it logs a report. Unless that's what you were driving at and I missed it, it's early. Too early for AWS to be down :(

Some 3600 people have hit that button in the last ~15 minutes.

Now 57 minutes later and it still reports everything as operating normally.
It shows errors now.
It doesn't show errors with Lambda and we clearly do experience them.
Community reporting > internal operations
I usually go on Twitter first for outages.