| If you have the agility to make rapid production changes, you also have the ability to rapidly rollback. This is just not true. Rollbacks are always more expensive than changes, because you can't rewind time to undo the consequences of having your software be broken for minutes, hours, or days. Worse, in the absence of "checks", the cost of making a production change tends to be roughly constant as the company grows -- it takes the Amazon sysadmin no more time to type "make deploy" than it does me -- but the cost of a rollback scales directly with the size of your company's customer base. Within a few seconds after Amazon.com breaks S3, thousands of companies begin to lose money, and they lose money second by second until the rollback happens. Even if Amazon is only down for a minute, that's one minute of downtime multiplied by its number of customers. The larger the customer base, the larger the stakes. And, unfortunately, the cost of downtime is nonlinear. If Amazon goes down for a mere two minutes, hundreds of peacefully sleeping system administrators will get emergency pages from their uptime-monitoring systems. They will get out of bed. They will check their logs and their failover mechanisms. They will lose a lot of sleep, and soak up a bunch of overtime pay, and a lot of their good will towards Amazon will dissipate like the morning dew. Once you lose your reputation for quality it takes a lot of work to get it back. This is why larger companies have more controls. The controls are in place to try and pass the ever-increasing cost of a rollback back to the team that causes the rollbacks. The reason it seems so gosh-darned expensive to add a trivial feature to your flagship app is that it is expensive: If the average rollback costs $1m in revenue and every new feature is only 95% reliable, every new feature costs the company $50k to deploy. The secret here is: If you want to deploy changes rapidly, don't work on a product that has a lot of uptime-sensitive customers! Start a different product line, or start a beta program, or found a smaller company. |
Let's say I own a video site and I want to add threaded comments. If I have 5 users and the site goes down for 5 minutes, those 5 users will get 5 minutes each of annoyance. If I have a million users, each of those users will get 5 minutes of annoyance each also. There is no difference to the user there. So, by adding more checks to make sure the site doesn't go down for 5 minutes when you have more users, you're saying the more users you have, more the important each user becomes. I think that's a strange way of thinking.
(The same is true here of an infrastructure service-- if S3 had 5 users and were more cavalier about their release schedule and broke something, those 5 users would exact the same net effects of downtime as if S3 had 5 million users.)
The awesome benefit of getting threaded comments developed, tested briefly, and pushed in one evening is worth the risk of 5 minutes of downtime compared to the 2 weeks of rigorous testing and approval-by-committee. No matter how many users you have.