Hacker News new | ask | show | jobs
by gecko 5578 days ago
EDIT: I posted a rundown of our deployment process, including where and how tests happen, and why they failed to catch this bug, at http://news.ycombinator.com/item?id=2301680 .

While I'm sure there's a lot of stuff we could improve, the situation's not exactly as you describe.

Responding to a few contacts:

1 & 2. We do have a list of "if this breaks, contact this guy." What we don't have (in response to your first point) is a demand that those people be available Sunday night.

3. We have a known rollback procedure. It does not work if we do a irreversible schema change and the problem's not caught until 20 hours later. We couldn't just throw out 20 hours of data.

4. We actually do a lot of testing. Beginning on Wednesday, we deploy to our early leak accounts. We steadily increase that through the week. The problem with this particular bug is that you could use Kiln lightly (most of our test accounts are not large accounts) without hitting this problem at all. Even the full QA test suite did not trigger the problem. That happened because Kiln was designed to keep working in the case of a FogBugz communication failure until it couldn't, which was directly proportional to how much you used Kiln. The real problem here, which has been fixed, is that Kiln should not attempt to hide a problem communicating with FogBugz.

5. We don't do release in the middle of the day. We do them at 10 PM. I have no idea where you got that.

There's a lot we can improve. We need to make sure Kiln not talking to FogBugz, which can bring down Kiln, hard-fails, instead of trying to continue. We need to make sure that all hands are on-deck when people are going to work, as you noted, which is vastly easier to do midweek than Sunday night. And we probably ought to add more automated testing to the integration points. But I think you're painting a somewhat unfair picture of the current situation.

1 comments

> What we don't have (in response to your first point) is a demand that those people be available Sunday night.

Boring, un-agile places have concepts like 24/7 rosters of operations staff and the ability to rotate "on call" duty amongst developers.

However I agree with your conclusion that performing irreversible rollouts are best achieved during (your) daylight hours.

> The real problem here, which has been fixed, is that Kiln should not attempt to hide a problem communicating with FogBugz.

Question: why wasn't an alert raised immediately?