|
|
|
|
|
by keypusher
4223 days ago
|
|
So, you had a bug in your code. That happens to everyone and I think we all understand. However, there are a number of other issues here which seem systemic and much more troubling. First, that your "flighting" did not catch the problem. Why was that? If the bug caused an infinite loop on all the live storage systems, that seems like it should have been fairly obvious on the customer systems you tested on. Second, that the patch was rolled out to all servers at the same time. You have admitted this was a mistake, but honestly it looks like amateur hour. If you are running business critical distributed cloud infrastructure, you just don't ever do this. Third, that there was extended fallout from rolling the patch back. If there are still customers experiencing downtime from this problem a full day later, that speaks to some serious flaws in the ops architecture and process. If you guys want to compete with AWS and similar platforms, it seems like you have a long way to go still. This set of mistakes should haunt you for a long time, because it's going to come up whenever someone is trying to convince their boss/colleague/team that Azure is a solid solution. |
|