Hacker News new | ask | show | jobs
by terom 693 days ago
https://azure.status.microsoft/en-us/status/history/ doesn't seem to have links to the individual incidents. Some reports claim the Azure / Microsoft 365 outages were related to crowdstrike, but this sounds like an entirely separate incident.

AFAIK the broken crowdstrike channel update happened at 2024-07-19 06:05 UTC and was "fixed" (rolled back) at 06:47 UTC, but I don't have a proper source for that timeline?

EDIT: https://azure.status.microsoft/en-gb/status claims 2024-07-18 19:00 UTC as the approximate start of impact for the crowdstrike update. It would be nice to find a proper source for the start and mitigation timelines...

EDIT: reddit threads reporting symptoms start at approx 2024-07-19 05:00 UTC. That would mean the crowdstrike impact started soon after the azure recovery.

---

What happened?

Between 21:56 UTC on 18 July 2024 and 12:15 UTC on 19 July 2024, customers may have experienced issues with multiple Azure services in the Central US region including failures with service management operations and connectivity or availability of services. A storage incident impacted the availability of Virtual Machines which may have also restarted unexpectedly. Services with dependencies on the impacted virtual machines and storage resources would have experienced impact.

What do we know so far?

We determined that a backend cluster management workflow deployed a configuration change causing backend access to be blocked between a subset of Azure Storage clusters and compute resources in the Central US region. This resulted in the compute resources automatically restarting when connectivity was lost to virtual disks hosted on impacted storage resources.

How did we respond?

21:56 UTC on 18 July 2024 – Customer impact began

22:13 UTC on 18 July 2024 – Storage team started investigating

22:41 UTC on 18 July 2024 – Additional Teams engaged to assist investigations

23:27 UTC on 18 July 2024 – All deployments in Central US stopped

23:35 UTC on 18 July 2024 – All deployments paused for all regions

00:45 UTC on 18 July 2024 – A configuration change as the underlying cause was confirmed

01:10 UTC on 19 July 2024 – Mitigation started

01:30 UTC on 19 July 2024 – Customers started seeing signs of recovery

02:51 UTC on 19 July 2024 – 99% of all impacted compute resources recovered

03:23 UTC on 19 July 2024 – All Azure Storage clusters confirmed recovery

03:41 UTC on 19 July 2024 – Mitigation confirmed for compute resources

Between 03:41 and 12:15 UTC on 19 July 2024 – Services which were impacted by this outage recovered progressively and engineers from the respective teams intervened where further manual recovery was needed. Following an extended monitoring period, we determined that impacted services had returned to their expected availability levels.

1 comments

The German BSI (Federal Office for IT Security) quotes the advisory from CrowdStrike (which is behind their customer login portal) as saying you need to roll-back to snapshots prior to 04:09 UTC:

https://www.bsi.bund.de/SharedDocs/Cybersicherheitswarnungen...

So your Redditor saying 05:00 UTC seems to be close.

https://old.reddit.com/r/crowdstrike/comments/1e6vmkf/bsod_e... quotes the CrowdStrike advisory verbatim, which has timestamps of 04:09 UTC for the problematic version and 05:27 for the reverted (good) version.