Hacker News new | ask | show | jobs
by th-th-throwaway 2421 days ago
People are saying there isn't enough info but I think the author definitely made the right analysis here.

The purpose of partial rollouts is to observe if there are bugs. The unusual case here is that version B's bug is forgetting backward compatibility, causing it to rapidly take down version A too. This means you can't simply rollback B as usual. After you rollback you still need to fix broken A instances. It's a lot of work but it would be the right thing to do.

Instead, they went all in on version B to avoid the bug they just introduced. This is completely reckless. You're skipping your usual process so you never get a chance to observe for other bugs in B. You should actually expect B to have even more bugs in it given you know it already has one major production breaking bug.

Going all in on a version that you're not confident in just to fix the one bug you know about is stupid.

2 comments

With the info given, it's entirely possible that the bug was actually in version A. Perhaps sending it a payload that is perfectly acceptable by the spec of how the API should work caused A to crash.

In that world, "fixing" B could involve sending invalid or unintended data to work around the problem in A, or patching A before rolling out B (which, when you're at the point of rolling out A.2, you may as well just roll out B)

if version b is doing what you just said it's doing. it's not a partial rollout and i am not sure i know what it is...