Hacker News new | ask | show | jobs
by rbanffy 2181 days ago
> If I make a deploy that turns out to be buggy

Unless your deploy reconfigures some networking component that makes a large part of your network inaccessible. Then you need to fix the network issue before you can rollback to a previous version. That may require someone driving up to a datacenter and logging into a racked server.

And then you may need to restore data if the network misconfiguration caused data to be corrupted somehow (I admit this is getting a bit worst-possible-case-scenario) and, if the data got crossed - that one client could see data from another - you'll need to prevent access until you are sure everything is where it should be.

Finally, depending on your scale, the deploy of a new version can take a long time by itself. People often deploy new features deactivated, then, when the whole fleet is updated, activate features to different groups and monitor for breaking behavior change.

1 comments

> People often deploy new features deactivated, then, when the whole fleet is updated, activate features to different groups and monitor for breaking behavior change.

Even better - then they shouldn't even need to make another deploy, just flip the feature flag back off. And if you need to make network changes, then test those out behind a load balancer in parallel to the existing topology, so you can start routing more traffic to the new setup, but can stop doing so if any problems arise. I'm not saying any of this is trivial, but the point is, best practices exist to start deploying pretty much any kind of change in a way that can be undone in minutes or even seconds. When you have access to the resources and talent that Github has, then there's zero acceptable reasons why your site would ever be down or degraded for hours on end - zero.

There are countless ways for infrastructure to break on its own, without being tied to a specific deploy or feature flag. A few common examples in the db tier alone:

Have you ever encountered a write rate that exceeds your db replica's ability to keep up with async replication? There's nothing to "roll back" in this case, and it takes time to determine whether the increase in write rate is from legitimate usage growth vs some recent feature (possibly deployed hours/days ago) writing more than expected during peak periods vs DDOS/bot activity.

Have you worked on multi-region infrastructure, where traffic is actively served from multiple geographic regions, with fully automated failover during regional outages? This is impossible to fully automate every possible situation -- even Google and Facebook have outages sometimes! Even just as a first step, it's hard to figure out conclusively which situations should be automated vs which ones need to alert humans.

Have you ever implemented read-after-write consistency for multi-region infrastructure, where multiple async DB replicas, caches, and backend file stores are not automatically in sync, but need to appear in sync to users making writes from non-master regions? The network latency between regions is sufficient to make this a complicated problem even when things are stable, let alone when there's other sources of replication lag to consider. There's no "out of the box" solution for this; every company needs to handle it in a way specific to their infrastructure and product.

Have you ever implemented a realistic dev/test environment for a massive infrastructure involving dozens to hundreds of services, and many different data stores, some of which are sharded? Again, no "out of the box" solution exists. You need to do something custom, and there will be plenty of cases where it doesn't accurately mirror production.

Or for a non-technical one: have you ever worked for a medium-to-large size company whose exit was via acquisition, rather than IPO? In my experience this always results in a major increase in attrition of the acquired company's top engineers. With an IPO, early folks are more incentivized to stay on; there's a better feeling of ownership, and the efforts of good talent can directly impact the stock price. But when it's an acquisition by some corporate behemoth, the opposite dynamic is at play: there's very little that the acquired company can do to impact the parent company's stock price, leading to a feeling of helplessness. Couple that with different policies and values mindset (say, a contract with a government agency that puts children in cages) and you can guess what happens.