Hacker News new | ask | show | jobs
by viraptor 5578 days ago
I can't believe in how misguided this post is...

1. They provide a service to people around the world, yet they don't ensure that someone is available as an emergency contact on Sunday evening, when the first post-deployment usage happens.

2. They don't have a universal list of "this breaks, contact that guy".

3. They don't have a known instant rollback procedure for a release.

4. They don't have cross-component integration tests and they don't do them manually either.

5. They decide that since they can't do a release that doesn't break stuff and can't organise themselves to resolve it quickly during the weekend when it affects only a small number of people, they'll do releases in the middle of the day now, so that they hear customers complaining right away.

Is that for real? Is he serious? Here's what I would get out of that issue (even if it's basically reiterating the "wrong" things above):

They need to do more integration testing before a release. They need to know who to contact and have to make sure the person is on call and ready for action. The person handling the issue needs to have a simple, quick way to reverse the release without manual intervention (tweaking the code). Again, this specific issue should get regression tests right away. And the most important thing - NEVER treat your customers as a test suite.

Of course I'm aware not everyone can afford operating like that. But at least this could be their goal. "Let's make breakage affect more people, so we know about it earlier and when we're at work" is a really silly conclusion.

3 comments

EDIT: I posted a rundown of our deployment process, including where and how tests happen, and why they failed to catch this bug, at http://news.ycombinator.com/item?id=2301680 .

While I'm sure there's a lot of stuff we could improve, the situation's not exactly as you describe.

Responding to a few contacts:

1 & 2. We do have a list of "if this breaks, contact this guy." What we don't have (in response to your first point) is a demand that those people be available Sunday night.

3. We have a known rollback procedure. It does not work if we do a irreversible schema change and the problem's not caught until 20 hours later. We couldn't just throw out 20 hours of data.

4. We actually do a lot of testing. Beginning on Wednesday, we deploy to our early leak accounts. We steadily increase that through the week. The problem with this particular bug is that you could use Kiln lightly (most of our test accounts are not large accounts) without hitting this problem at all. Even the full QA test suite did not trigger the problem. That happened because Kiln was designed to keep working in the case of a FogBugz communication failure until it couldn't, which was directly proportional to how much you used Kiln. The real problem here, which has been fixed, is that Kiln should not attempt to hide a problem communicating with FogBugz.

5. We don't do release in the middle of the day. We do them at 10 PM. I have no idea where you got that.

There's a lot we can improve. We need to make sure Kiln not talking to FogBugz, which can bring down Kiln, hard-fails, instead of trying to continue. We need to make sure that all hands are on-deck when people are going to work, as you noted, which is vastly easier to do midweek than Sunday night. And we probably ought to add more automated testing to the integration points. But I think you're painting a somewhat unfair picture of the current situation.

> What we don't have (in response to your first point) is a demand that those people be available Sunday night.

Boring, un-agile places have concepts like 24/7 rosters of operations staff and the ability to rotate "on call" duty amongst developers.

However I agree with your conclusion that performing irreversible rollouts are best achieved during (your) daylight hours.

> The real problem here, which has been fixed, is that Kiln should not attempt to hide a problem communicating with FogBugz.

Question: why wasn't an alert raised immediately?

To clarify a few things:

1. Tim, our sysadmin, was the emergency contact and was on top of things ASAP, but didn't have the particular knowledge to fix it himself. That required a developer.

2. He called me first, since the problem appeared to be in Kiln. I'm a Kiln dev, I can and have fixed things on a Sunday evening after a deploy. I missed his first call, but got back to him within 3 minutes (the 30 minutes mentioned in the article was Ben's guess). I started diagnosing the problem and realized it wasn't Kiln specifically that was the problem, but something in the communication between the two. That meant we needed a FogBugz dev, which we got quickly, and possibly a deploy..

3. That led us to investigate rolling the specific account back to the previous version. 98 percent of our updates are reversible, but as Ben mentioned, this particular release included not one, but two irreversible database migrations, and since the upgrade step had run successfully, going back would not be an option.

4. All tests passed (both automated and manual). Ben has updated the article to make it clear that all but one API call between Kiln and FogBugz was working, and the one call that was broken, the one that lead to this crash, is called very infrequently (on the order of months for some accounts). Yes, integration tests should have and will cover that one API call, but missing one corner case is very different than not doing integration testing.

5. Given the situation we were in, the problem will always be solved more quickly when we're in the office than when we're at home. We take every possible precaution to avoid outages, but they will still happen, and moving to mid-week deploys is just another precaution to decrease the impact of these outages if and when they occur in the future.

So in short, yes, this is the very definition of real software, and we take this very seriously. Your bullet list of armchair quarterback suggestions grossly oversimplifies the situation. The goal is to have problems affect fewer users, which is directly affected by our response time.

It is so refreshing for any person or company to come out and explain, in detail, how they screwed up and what they are doing to ensure it doesn't happen again.

The tone of your post encourages people to cover up their mistakes for fear of ridicule, and I am against that. Some of your points are worthy of debate though.