| I can't believe in how misguided this post is... 1. They provide a service to people around the world, yet they don't ensure that someone is available as an emergency contact on Sunday evening, when the first post-deployment usage happens. 2. They don't have a universal list of "this breaks, contact that guy". 3. They don't have a known instant rollback procedure for a release. 4. They don't have cross-component integration tests and they don't do them manually either. 5. They decide that since they can't do a release that doesn't break stuff and can't organise themselves to resolve it quickly during the weekend when it affects only a small number of people, they'll do releases in the middle of the day now, so that they hear customers complaining right away. Is that for real? Is he serious? Here's what I would get out of that issue (even if it's basically reiterating the "wrong" things above): They need to do more integration testing before a release. They need to know who to contact and have to make sure the person is on call and ready for action. The person handling the issue needs to have a simple, quick way to reverse the release without manual intervention (tweaking the code). Again, this specific issue should get regression tests right away. And the most important thing - NEVER treat your customers as a test suite. Of course I'm aware not everyone can afford operating like that. But at least this could be their goal. "Let's make breakage affect more people, so we know about it earlier and when we're at work" is a really silly conclusion. |
While I'm sure there's a lot of stuff we could improve, the situation's not exactly as you describe.
Responding to a few contacts:
1 & 2. We do have a list of "if this breaks, contact this guy." What we don't have (in response to your first point) is a demand that those people be available Sunday night.
3. We have a known rollback procedure. It does not work if we do a irreversible schema change and the problem's not caught until 20 hours later. We couldn't just throw out 20 hours of data.
4. We actually do a lot of testing. Beginning on Wednesday, we deploy to our early leak accounts. We steadily increase that through the week. The problem with this particular bug is that you could use Kiln lightly (most of our test accounts are not large accounts) without hitting this problem at all. Even the full QA test suite did not trigger the problem. That happened because Kiln was designed to keep working in the case of a FogBugz communication failure until it couldn't, which was directly proportional to how much you used Kiln. The real problem here, which has been fixed, is that Kiln should not attempt to hide a problem communicating with FogBugz.
5. We don't do release in the middle of the day. We do them at 10 PM. I have no idea where you got that.
There's a lot we can improve. We need to make sure Kiln not talking to FogBugz, which can bring down Kiln, hard-fails, instead of trying to continue. We need to make sure that all hands are on-deck when people are going to work, as you noted, which is vastly easier to do midweek than Sunday night. And we probably ought to add more automated testing to the integration points. But I think you're painting a somewhat unfair picture of the current situation.