Hacker News new | ask | show | jobs
by sitepodmatt 3205 days ago
Is there followup post-mortems to incidents like this which is over a month ago?

[Fastmail] Services have been restored. The problem was a a network peering issue, leading to our services being unavailable to parts of the internet. We're working with our network provider to understand what happened and what improvements we can make in the future. Thank you for your patience.

An 1hr15mins of connectivity issues is quite significant imo.

Wary after the Imgix several hours of 500/504 disaster and still no technical post-mortem..

1 comments

Hmm... this is the Aug 5th incident right? Here's the post-mortem from NYI:

"On Saturday, August 5, 2017, there was scheduled maintenance period on a core switch in our NYC Datacenter Facility that was scheduled from 10:30PM - 2AM. Customers that were directly connected to this switch were notified that there would be a service impacting maintenance but there was no expected impact beyond these specific connections.

For some time during this maintenance, traffic from some upstream peers, including Cogent and the NYIIX Peering Exchange, experienced increased latency and intermittent loss of connectivity, due to a misconfiguration that did not effectively re-converge this traffic to other upstream providers. This incident started at approximately 11:30PM and was resolved by 12:30AM.

We have resolved the root cause of the issue, added additional monitoring and updated our notification procedures to ensure that all customers are notified in the future for such maintenance windows, even when there is no expected service impact. We are also upgrading our notification systems and customers will be contacted in the near future to confirm the contacts listed on the account to ensure that all such notifications are properly received."

This is the first time I've seen them mess something like this up in a very long time, and they're really good about fixing their proceedures afterwards.

We've been offline for similar lengths of time during the nasty run of DDOS attacks on providers a couple of years ago, but having upgraded to Fibre to the rack, 10G drops to the external bladecentres and DDOS protection services on our public IP range plus hidden private ranges for our backhaul services, we have successfully mitigated all the driveby attacks since.

... that and we haven't been a target for a bit (touch wood) - though we will write up something the DNS attacks at some point, they were a bit spectacular. Went from 200 req/sec to 100,000 req/sec for random hostnames on our servers. The engineering challenges to cope with that while still providing our powerful custom DNS options are quite interesting in retrospect. Not so much at the time!