| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by swombat 5578 days ago

Wait, what?

You have large numbers of paying customers to whom you're delivering a mission-critical system (source control isn't exactly optional), and your releases involve neither automated production monitoring/continuous deployment nor formal release procedures?

I think your problem is more than just weekend deployments!

My full comments here: http://swombat.com/2011/3/8/fog-creek-dont-do-cowboy-deploym...

7 comments

gecko 5578 days ago

The releases are both automated (except for one component, as noted, which we are now automating), and are fully vetted.

Here is the old release process:

1. Monday morning, the version to be used for the next release is automatically built for the QA team, who begins running their test suites on it and doing soft checks.

2. By no later than Wednesday, the new version is leaked to testing an alpha accounts on Fog Creek On Demand. Tests are re-run at this point.

3. The leak is increased later in the week if the QA results look good, or the weekend release is canceled, depending on how testing goes.

4. Provided everything has been good, on Saturday night, the leak is increased to 100% of customers. This step does not have a full QA rundown, because the code has already been vetted several times by QA at this point. The sanity checks are truly sanity checks.

5. At the same time, we monitor that our monitoring system (Nagios) agrees that all accounts are online and that there are no major problems, such as massive CPU spikes.

So far, so good. The issue with this release is we had a bug that did not manifest for awhile, because Kiln had been deliberately designed to ignore the failure condition "as long as possible", which ended up just being too damn long. Once we started having failures, we noticed--that's why our sysadmin called us in--but those failures started happening 20 hours after the 100% release, and several days after testing and alpha accounts were upgraded.

I am not arguing our system is perfect, but I'm a nonplussed where the your-deployment-system-totally-sucks stuff is coming from. I'll ask our build manager to post an even more detailed rundown.

bluesnowmonkey 5578 days ago

Sincere question: how do you leak irreversible schema changes to a subset of accounts? Isn't the point of the leak that you're not confident and might need to reverse it? Or are you willing to let those accounts get hosed?

tedunangst 5578 days ago

Fix it by hand. If it's ten accounts, that's pretty easy. If it's ten thousand, more of a problem.

When you read irreversible, think "very difficult to reverse and not worth the cost of writing and validating code we don't ever expect to run."

mnutt 5578 days ago

Perhaps have Kiln send notifications on the failure conditions even if it doesn't throw an error? Better a few false positives than no indication at all.

dpritchett 5578 days ago

Maybe in the future we can all be IMVU:

Back to the deploy process, nine minutes have elapsed and a commit has been greenlit for the website. The programmer runs the imvu_push script. The code is rsync’d out to the hundreds of machines in our cluster. Load average, cpu usage, php errors and dies and more are sampled by the push script, as a basis line. A symlink is switched on a small subset of the machines throwing the code live to its first few customers. A minute later the push script again samples data across the cluster and if there has been a statistically significant regression then the revision is automatically rolled back. If not, then it gets pushed to 100% of the cluster and monitored in the same way for another five minutes. The code is now live and fully pushed. This whole process is simple enough that it’s implemented by a handfull of shell scripts.

http://timothyfitz.wordpress.com/2009/02/10/continuous-deplo...

aprrrr 5578 days ago

In fairness, that's a description of what a routine and successful build "should" go like. I bet if IMVU were to post a blow-by-blow account of their hairiest deployment screwup ever, it would be a good bit more colorful than that.

There are some headscratchers in the description of the Fogbugz problem, but kudos to them for explaining how and why things broke.

tjarratt 5578 days ago

Tim Fitz' blog is a great source of continuous deployment done right and finding useful information in a sea of chaos.

I remember talking to him before and after he wrote some of these blog posts and it was fascinating seeing how his attitude regarding failure changed.

sghael 5578 days ago

I agree. My other thought was 'isn't there a staging server in there somewhere?' Something that is near identical to production, with fake production data, etc, that could surface the problem before a customer sees it.

btw, props to Fog Creek and OP for airing their dirty laundry. They take some heat, but in the end we all learn from it.

gecko 5578 days ago

We have more than staging servers: we have staging accounts. I documented our full release process at http://news.ycombinator.com/item?id=2301680.

kamens 5578 days ago

It's stunning how easy it is to spot a specific lack of "automated production monitoring" after something fails. Hey idiot, you should've been testing that thing!

I've seen all of Fog Creek's automated production monitoring courtesy of their sysadmins and devs as it was months ago, and it was very solid. I'm sure it's only gotten better.

This is a case of a specific deployment failure slipping through the cracks and being honestly explained, apologized for, and rectified. I'm obviously biased due to my history (and probably-justified guilt for this particular failure), but shotgun criticism about formal release procedures is very misguided.

brown9-2 5578 days ago

Two better approaches come to mind to resolve this:

2. Full-on, properly managed releases like they do in large IT corporations, such as banks, where a "release" is not something you kick off from home via SSH on a Saturday night, but a properly planned effort that involves critical members of the dev team as well as the QA team being present and ready to both test the production system thoroughly and fix any issues that may occur.

What you describe in #2 here sounds like a complete anti-pattern when compared with the idea of continuous deployment and automated verification. This 2nd approach sounds like a huge manual effort.

swombat 5578 days ago

It absolutely is, and I'd be surprised to see this kind of effort from any but the most paranoid corporations (like, as I mentioned, banks). Automation and continuous deployment are definitely the way forward.

But even this gargantuan effort is a better option than just "let's deploy and wait for our users to tell us if anything has gone wrong".

brown9-2 5578 days ago

But even this gargantuan effort is a better option than just "let's deploy and wait for our users to tell us if anything has gone wrong".

To be fair it sounds like in the original article that they did do some verification that things were working after the deployment. However for some reason their verification tests didn't reveal the presence of a real bug.

Even in a more gargantuan system, it's possible to have tests that give false positive results.

Everyone will screw up releases at some point, the key is to be able to learn from them and get better.

peterwwillis 5578 days ago

If you're making a big change you first cut a CR and get approval of any teams involved. At change time, everyone knows they need to be on-call if something breaks, preferably in a live chatroom.

The rest of the time devs should just deploy when they think the code is ready and have tested it on a like-production box. They then manually verify the change worked. You use automated monitoring to ensure when something does break you are notified immediately.

rst 5578 days ago

Their release procedures didn't cover the case, and they're fixing it ("modifying the communication ... [to] fail early and loudly during our initial tests", according to the "with details" post on their status blog[1]).

But I still find their lack of monitors... disturbing.

[1] http://status.fogcreek.com/2011/03/sunday-night-kiln-outage-...

peterwwillis 5578 days ago

I agree except I don't think continuous deployment necessarily means automatic deployment. Every deploy should be done by a person and tested right after; none of this "push out all commits at X time" or "push as soon as it's committed" as both are risky.

During the day is usually preferred and never at 4:59PM on a Friday or right before everyone goes to lunch (ever had to clean up a downed cluster when some jerk pushed bad code and the whole team went to Sweet Tomatoes? yeah).

To help troubleshoot breaks, have a mailing list with changelogs showing who made a change, time/date, files touched. Also have your deploy tools mail it when there's a code push, rollback, server restart, etc. Have a simple tool someone can run to revert changes back to a time of day so if something breaks just "revert back to 6 hours ago" and debug while your old app is running (nice to take one broken box offline first to test on).