I agree. Fun read and cool hack, but it definitely feels like they are stretching to justify the more fun of the two options (spend time on this or spend time fixing the root cause).
Yeah, my initial reaction was "I can see how these design decisions might make sense, but the blog post is horrible."
These kinds of designs typically emerge over a long and windy history and, for someone who was part of that process, it's difficult to coherently describe the final state to an outsider. Good textbook authors have this skill. Most tech blog authors do not. (I think that part of the problem is that people don't respect just how difficult it actually is.)
My guess: restarting a large fleet of processes is a pain. The rollout will typically be throttled to avoid connection churn, among other things. For risky code changes, you probably want a slow rollout anyway, but if you're just tweaking abuse detection rules (almost just a config change), it's nice to have your changes take effect more quickly. Dynamic loading seems like one reasonable way to achieve that goal.
Tangent: people, please stop making analogies to mechanical engineering feats that are WAY more difficult than what you did [1]. People have been loading shared libraries forever; it's like adding an AUX port, not swapping out the engine. It's not even in the same league as Ksplice or as the JVM's dynamic loading/deoptimization.
You're right, I don't know all the intricacies of their system. That's why I said "I'm surprised" rather than "this is a bad design decision". It doesn't mean I can't point out potential pitfalls that I think the blog post glosses over.
They explained their justification; if they don't want random people on random forums disagreeing with their justification because it wasn't complete enough, they are free to make it more complete.
You're reading a blog post, you do not know all they have tried, nor the various intricacies they're dealing with.