| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mechanical_fish 6453 days ago

If you have the agility to make rapid production changes, you also have the ability to rapidly rollback.

This is just not true. Rollbacks are always more expensive than changes, because you can't rewind time to undo the consequences of having your software be broken for minutes, hours, or days. Worse, in the absence of "checks", the cost of making a production change tends to be roughly constant as the company grows -- it takes the Amazon sysadmin no more time to type "make deploy" than it does me -- but the cost of a rollback scales directly with the size of your company's customer base.

Within a few seconds after Amazon.com breaks S3, thousands of companies begin to lose money, and they lose money second by second until the rollback happens. Even if Amazon is only down for a minute, that's one minute of downtime multiplied by its number of customers. The larger the customer base, the larger the stakes.

And, unfortunately, the cost of downtime is nonlinear. If Amazon goes down for a mere two minutes, hundreds of peacefully sleeping system administrators will get emergency pages from their uptime-monitoring systems. They will get out of bed. They will check their logs and their failover mechanisms. They will lose a lot of sleep, and soak up a bunch of overtime pay, and a lot of their good will towards Amazon will dissipate like the morning dew. Once you lose your reputation for quality it takes a lot of work to get it back.

This is why larger companies have more controls. The controls are in place to try and pass the ever-increasing cost of a rollback back to the team that causes the rollbacks. The reason it seems so gosh-darned expensive to add a trivial feature to your flagship app is that it is expensive: If the average rollback costs $1m in revenue and every new feature is only 95% reliable, every new feature costs the company $50k to deploy.

The secret here is: If you want to deploy changes rapidly, don't work on a product that has a lot of uptime-sensitive customers! Start a different product line, or start a beta program, or found a smaller company.

1 comments

dcurtis 6453 days ago

S3 is a really bad example because they provide infrastructure. Their customers actually see their entire site go down. Those kinds of companies are the exception. I hope Heroku has rigorous testing and scrutinizes every change, even though they are a startup.

Let's say I own a video site and I want to add threaded comments. If I have 5 users and the site goes down for 5 minutes, those 5 users will get 5 minutes each of annoyance. If I have a million users, each of those users will get 5 minutes of annoyance each also. There is no difference to the user there. So, by adding more checks to make sure the site doesn't go down for 5 minutes when you have more users, you're saying the more users you have, more the important each user becomes. I think that's a strange way of thinking.

(The same is true here of an infrastructure service-- if S3 had 5 users and were more cavalier about their release schedule and broke something, those 5 users would exact the same net effects of downtime as if S3 had 5 million users.)

The awesome benefit of getting threaded comments developed, tested briefly, and pushed in one evening is worth the risk of 5 minutes of downtime compared to the 2 weeks of rigorous testing and approval-by-committee. No matter how many users you have.

link

mechanical_fish 6453 days ago

I used an infrastructure site as an example because the value proposition is easy to understand when you use a site that has a clear and simple monetization strategy. Video sharing sites are arguably an even worse example than S3, because the value of uptime is so hard to perceive or compute. It's likely that even Twitter doesn't understand the true value of a customer-hour of Twitter uptime, because the site isn't monetized and so much of the value is concentrated in the brand. Measuring that is like voodoo, only less empirical. ;)

If I have 5 users and the site goes down for 5 minutes, those 5 users will get 5 minutes each of annoyance. If I have a million users, each of those users will get 5 minutes of annoyance each also. There is no difference to the user there.

No, but there is a big difference for you! If a user is worth a dollar per year, the five-user site is worth five bucks per year, but the million-user site is worth a million bucks. If each patch to your code causes 0.1% of users to abandon your product (a number which depends on the odds that a patch will cause a rollback, and on the odds that a rollback will annoy a user enough to make them leave), patching a 5-user site costs you half a cent per year on average (most likely it has no perceptable cost, since odds are no users will leave) but each patch to a million-user site costs you $1000 per year in revenue. And that's just the linear cost. There are nonlinear consequences: one or zero annoyed users is nothing to worry about -- unless that user is Michael Arrington -- but a clique of 1000 annoyed users is potentially a movement: a critical mass of people who will all start complaining about your company on Twitter on the same day, potentially costing you your next 10,000 or 100,000 or 1 million users while simultaneously empowering your competitors, who may begin building the site that will take you down by poaching those dissatisfied users.

This is just the flip side of scalability. As a programmer you enjoy mighty economies of scale: Running a site with a million users is more expensive than running a single-user site, but it is much less than a million times as expensive. But this leverage also applies to your mistakes: a mistake that costs you a dollar when your site is small might cost you $1,000,000 when your site is big. And it's the same mistake! Typos are just as easy to make on big sites as on small ones.

Obviously, this doesn't mean that you shouldn't ever change the site. Presumably each and every one of your patches is valuable, and will bring in revenue to pay for its own insurance premiums. Right? :) But you do need to think about that calculation, because you do occasionally make mistakes. As your userbase grows, you may wish to test each patch on a subset of users to be sure they will really like it, and that the additional revenue is really going to be there. You may wish to institute tests and internal audits that lower the risk of rollbacks, or failover mechanisms to lower the cost of rollbacks. And before long, lo, you will be that which you deplore: A company with a bunch of annoying internal controls! But at least you'll have revenue to console yourself with.

link

fallentimes 6453 days ago

But I think what Dustin is saying (correct me if I'm wrong) is that the multiplier applies both ways. And that the total cost of making a 5 minute downtime mistake, even to a million users, could easily be outweighed by the benefits of releasing a product/feature/site 2 weeks early. In most cases, I think large companies are risk adverse instead of risk neutral to situations like this.

I agree with both of you that it varies considerably based on what the site does (infrastructure, videos, games, etc).

link

mechanical_fish 6453 days ago

In most cases, I think large companies are risk adverse instead of risk neutral to situations like this.

I'm not going to argue with that. Just because a certain increase of caution is rational doesn't mean that caution isn't being overapplied in many cases, just as PG suggests in his original post.

link