| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by keypusher 4270 days ago
	So, you had a bug in your code. That happens to everyone and I think we all understand. However, there are a number of other issues here which seem systemic and much more troubling. First, that your "flighting" did not catch the problem. Why was that? If the bug caused an infinite loop on all the live storage systems, that seems like it should have been fairly obvious on the customer systems you tested on. Second, that the patch was rolled out to all servers at the same time. You have admitted this was a mistake, but honestly it looks like amateur hour. If you are running business critical distributed cloud infrastructure, you just don't ever do this. Third, that there was extended fallout from rolling the patch back. If there are still customers experiencing downtime from this problem a full day later, that speaks to some serious flaws in the ops architecture and process. If you guys want to compete with AWS and similar platforms, it seems like you have a long way to go still. This set of mistakes should haunt you for a long time, because it's going to come up whenever someone is trying to convince their boss/colleague/team that Azure is a solid solution.

2 comments

coreysa 4270 days ago

Thanks. We are continuing to investigate this and driving needed improvements in our process and technology to avoid similar issues in the future.

link

ohyesyodo 4270 days ago

The last two times there was a big issue the same thing happened with the status dashboard (it became inaccessible). I remember the same issue when the certs expired 1,5 years ago. I really like Microsoft and was convinced "you" would somehow isolate the dashboard and host it separately, but it turns out I was wrong. Do you happen to know the reasons for hosting the status dashboard inside of Azure? It seems so counter-intuitive to me. Or is it actually hosted externally but died due to the load when the issue started to appear?

The OP mentions that Microsoft representatives gave info via public forums. When the issue appeared I looked in different places trying to find info, but only I found was a statement saying that We are aware of issues. I looked at Azure twitter/blog, ScottGu twitter/blog, Hanselmans, MSDN forums. I also tried this forum and reddit. Do you know where I should have gone to receive details?

link

coreysa 4269 days ago

Thanks. The communications and the service health dashboard are two areas that we are creating improvement plans from the learning of this event. For the dashboard, we do expect it to continue to run even through outages like this one, but we did encounter an issue with our fallback mechanism that we need to understand more deeply.

For general communications, we did most of our early communication on the event using twitter, announcing the incident and giving updates. We need to build a more formal multi-pronged approach to communicating, including faster responses in the MSDN forums and here in HN to make sure we are reaching as many of our customers and partners as possible. Thanks again for the feedback!!

link

je42 4270 days ago

^This

link