Hacker News new | ask | show | jobs
by m12k 2185 days ago
What I don't get is how they can be down for hours on end. If I make a deploy that turns out to be buggy, I'll roll my site back to a stable previous version within minutes. Sure, that's not always possible if I've made incompatible database schema changes, but in my experience those are very, very rare (i.e. I almost always add db columns, and only rarely delete or rename columns - and when I do, I do so after those columns haven't been in use for a while).
3 comments

I'm sure these down times are more about infrastructure problems than just database updates and I'm also sure that this infrastructure is a little more complexe than your site.
I'm sure their deployment process is way more complex than mine. I'm also sure that unlike me, they have engineers dedicated to managing the complexity. GitHub is a company that has blogged about how easy they have made deployment and how they deploy many times a day. I don't believe there are any compelling reasons why that ease of deployment shouldn't also extend to re-deploying previous versions of their service, so they don't need to leave a bad version up for hours.
I think you're missing the point a bit. Not all changes are simple code deploys. They take time to diagnose, mitigate and fix. This is especially true when you have a lot of services that talk to each other, working on infrastructure that can support so many services and users.
It’s often not as easy as deploy -> immediate problem -> rollback. Problems can take a while to diagnose, or may cause some kind of poisoning that needs to be fixed (eg rebuild a lost or corrupt cache), or be in some part of the system that nobody knew was related (eg maybe someone deployed code that talks to a hitherto-unqueried accounting system and that worked fine at 4pm on Thursday but come 9am Monday it melts).

My point is that in big complex systems sometimes there is not a straight line between cause end effect. Sometimes there’s just effect and you need to work out the cause.

> If I make a deploy that turns out to be buggy

Unless your deploy reconfigures some networking component that makes a large part of your network inaccessible. Then you need to fix the network issue before you can rollback to a previous version. That may require someone driving up to a datacenter and logging into a racked server.

And then you may need to restore data if the network misconfiguration caused data to be corrupted somehow (I admit this is getting a bit worst-possible-case-scenario) and, if the data got crossed - that one client could see data from another - you'll need to prevent access until you are sure everything is where it should be.

Finally, depending on your scale, the deploy of a new version can take a long time by itself. People often deploy new features deactivated, then, when the whole fleet is updated, activate features to different groups and monitor for breaking behavior change.

> People often deploy new features deactivated, then, when the whole fleet is updated, activate features to different groups and monitor for breaking behavior change.

Even better - then they shouldn't even need to make another deploy, just flip the feature flag back off. And if you need to make network changes, then test those out behind a load balancer in parallel to the existing topology, so you can start routing more traffic to the new setup, but can stop doing so if any problems arise. I'm not saying any of this is trivial, but the point is, best practices exist to start deploying pretty much any kind of change in a way that can be undone in minutes or even seconds. When you have access to the resources and talent that Github has, then there's zero acceptable reasons why your site would ever be down or degraded for hours on end - zero.

There are countless ways for infrastructure to break on its own, without being tied to a specific deploy or feature flag. A few common examples in the db tier alone:

Have you ever encountered a write rate that exceeds your db replica's ability to keep up with async replication? There's nothing to "roll back" in this case, and it takes time to determine whether the increase in write rate is from legitimate usage growth vs some recent feature (possibly deployed hours/days ago) writing more than expected during peak periods vs DDOS/bot activity.

Have you worked on multi-region infrastructure, where traffic is actively served from multiple geographic regions, with fully automated failover during regional outages? This is impossible to fully automate every possible situation -- even Google and Facebook have outages sometimes! Even just as a first step, it's hard to figure out conclusively which situations should be automated vs which ones need to alert humans.

Have you ever implemented read-after-write consistency for multi-region infrastructure, where multiple async DB replicas, caches, and backend file stores are not automatically in sync, but need to appear in sync to users making writes from non-master regions? The network latency between regions is sufficient to make this a complicated problem even when things are stable, let alone when there's other sources of replication lag to consider. There's no "out of the box" solution for this; every company needs to handle it in a way specific to their infrastructure and product.

Have you ever implemented a realistic dev/test environment for a massive infrastructure involving dozens to hundreds of services, and many different data stores, some of which are sharded? Again, no "out of the box" solution exists. You need to do something custom, and there will be plenty of cases where it doesn't accurately mirror production.

Or for a non-technical one: have you ever worked for a medium-to-large size company whose exit was via acquisition, rather than IPO? In my experience this always results in a major increase in attrition of the acquired company's top engineers. With an IPO, early folks are more incentivized to stay on; there's a better feeling of ownership, and the efforts of good talent can directly impact the stock price. But when it's an acquisition by some corporate behemoth, the opposite dynamic is at play: there's very little that the acquired company can do to impact the parent company's stock price, leading to a feeling of helplessness. Couple that with different policies and values mindset (say, a contract with a government agency that puts children in cages) and you can guess what happens.

Which commit would you roll back to if the integer id of your comments table overflows? There are many bugs that are more complex than a buggy commit.