Hacker News new | ask | show | jobs
by m12k 2181 days ago
> People often deploy new features deactivated, then, when the whole fleet is updated, activate features to different groups and monitor for breaking behavior change.

Even better - then they shouldn't even need to make another deploy, just flip the feature flag back off. And if you need to make network changes, then test those out behind a load balancer in parallel to the existing topology, so you can start routing more traffic to the new setup, but can stop doing so if any problems arise. I'm not saying any of this is trivial, but the point is, best practices exist to start deploying pretty much any kind of change in a way that can be undone in minutes or even seconds. When you have access to the resources and talent that Github has, then there's zero acceptable reasons why your site would ever be down or degraded for hours on end - zero.

1 comments

There are countless ways for infrastructure to break on its own, without being tied to a specific deploy or feature flag. A few common examples in the db tier alone:

Have you ever encountered a write rate that exceeds your db replica's ability to keep up with async replication? There's nothing to "roll back" in this case, and it takes time to determine whether the increase in write rate is from legitimate usage growth vs some recent feature (possibly deployed hours/days ago) writing more than expected during peak periods vs DDOS/bot activity.

Have you worked on multi-region infrastructure, where traffic is actively served from multiple geographic regions, with fully automated failover during regional outages? This is impossible to fully automate every possible situation -- even Google and Facebook have outages sometimes! Even just as a first step, it's hard to figure out conclusively which situations should be automated vs which ones need to alert humans.

Have you ever implemented read-after-write consistency for multi-region infrastructure, where multiple async DB replicas, caches, and backend file stores are not automatically in sync, but need to appear in sync to users making writes from non-master regions? The network latency between regions is sufficient to make this a complicated problem even when things are stable, let alone when there's other sources of replication lag to consider. There's no "out of the box" solution for this; every company needs to handle it in a way specific to their infrastructure and product.

Have you ever implemented a realistic dev/test environment for a massive infrastructure involving dozens to hundreds of services, and many different data stores, some of which are sharded? Again, no "out of the box" solution exists. You need to do something custom, and there will be plenty of cases where it doesn't accurately mirror production.

Or for a non-technical one: have you ever worked for a medium-to-large size company whose exit was via acquisition, rather than IPO? In my experience this always results in a major increase in attrition of the acquired company's top engineers. With an IPO, early folks are more incentivized to stay on; there's a better feeling of ownership, and the efforts of good talent can directly impact the stock price. But when it's an acquisition by some corporate behemoth, the opposite dynamic is at play: there's very little that the acquired company can do to impact the parent company's stock price, leading to a feeling of helplessness. Couple that with different policies and values mindset (say, a contract with a government agency that puts children in cages) and you can guess what happens.