|
To clarify a few things: 1. Tim, our sysadmin, was the emergency contact and was on top of things ASAP, but didn't have the particular knowledge to fix it himself. That required a developer. 2. He called me first, since the problem appeared to be in Kiln. I'm a Kiln dev, I can and have fixed things on a Sunday evening after a deploy. I missed his first call, but got back to him within 3 minutes (the 30 minutes mentioned in the article was Ben's guess). I started diagnosing the problem and realized it wasn't Kiln specifically that was the problem, but something in the communication between the two. That meant we needed a FogBugz dev, which we got quickly, and possibly a deploy.. 3. That led us to investigate rolling the specific account back to the previous version. 98 percent of our updates are reversible, but as Ben mentioned, this particular release included not one, but two irreversible database migrations, and since the upgrade step had run successfully, going back would not be an option. 4. All tests passed (both automated and manual). Ben has updated the article to make it clear that all but one API call between Kiln and FogBugz was working, and the one call that was broken, the one that lead to this crash, is called very infrequently (on the order of months for some accounts). Yes, integration tests should have and will cover that one API call, but missing one corner case is very different than not doing integration testing. 5. Given the situation we were in, the problem will always be solved more quickly when we're in the office than when we're at home. We take every possible precaution to avoid outages, but they will still happen, and moving to mid-week deploys is just another precaution to decrease the impact of these outages if and when they occur in the future. So in short, yes, this is the very definition of real software, and we take this very seriously. Your bullet list of armchair quarterback suggestions grossly oversimplifies the situation. The goal is to have problems affect fewer users, which is directly affected by our response time. |