|
DO killed 2 of our production server some weeks ago erroneously due to an issue on their end that claimed we were part of a ddos attack. Took us an entire week to recover properly... maybe this might have helped... also was promised credits for the downtime but never received them, minor after the fact as we're pretty happy with the service overall. > The Incident: Beginning at 17:10 UTC, May 9th, multiple DigitalOcean customers experienced Droplet network outages due to an action on Droplets by an automated mechanism. This mechanism has been in place at DigitalOcean since 2019. It helps us ensure that any potentially compromised Droplet seen participating in an outbound Denial of Service attack is quickly taken offline. This is in place to assist in protecting all DO customers by ensuring we have a network focused on delivering legitimate traffic at speed and scale, unencumbered by illegitimate traffic. When triggered, this mechanism suspends networking capabilities on the Droplet or Droplet-based services temporarily to allow the owner to investigate the issue. Users are informed via a support ticket and email that details the paths to recovery. This incident was triggered by an unannounced data change made by a third-party, which DigitalOcean uses to assist in analyzing traffic flow and metrics, as well as detecting malicious traffic patterns. Due to this mechanism constantly running and no changes being made directly by DigitalOcean, our teams were delayed in beginning an incident response. After multiple reports from customers that they believed the notification of outgoing Denial of Service attacks from their Droplets were false positives, an internal incident was declared to investigate the issue and start remediation efforts. After a thorough investigation by the DigitalOcean Security and Networking teams, the root cause was discovered to be an erroneous change made by a third-party service that reports data on traffic. Contact was established with the third-party, and they confirmed a change had been made. Investigation began on their side, and they confirmed there was a bug causing bad data to be returned from their API. Remediation of this incident was done through multiple paths. Complete resolution was achieved once the third-party rolled back the change that was made, which was causing bad data to be reported to DigitalOcean systems. Before that rollback was able to be put in place, DigitalOcean took direct action to take the automated mechanism that disables Droplet networking offline, given the suspected bad data. The support teams also worked throughout this incident to directly address customer tickets and re-enable networking on impacted Droplets. |