|
|
|
|
|
by joshuamorton
2719 days ago
|
|
> Canary failures at other times trigger steady state alarms. Thus, our canaries serve both to validate new versions and continuously monitor the existing versions. This is bad. Canaries are not alerts. Alerting should be separate from canary. If you're getting steady state alerts from your canaries, you're steady-state alerting is bad. A system that isn't undergoing change should be able to entirely disable its canary, and you should still remain confident that any changes in traffic or whatnot should be handled by other alerting tools. If not, then your canaries aren't correctly balanced, and your getting canary failures due to traffic imbalances or something that shouldn't be affecting a good canary. In other words, if you're canary is alerting in steady state, your experimental setup is invalid and you don't have a good test/control pair. If you did, the only steady state alerts you'd get would be noise. You can't have the same experimental setup control for production differences to isolate changes due to binary version bumps, while also detecting changes in production traffic independent of binary version bumps. At least one of those is broken. |
|
I would agree that it is not appropriate for all applications/systems, and in fact I have pushed against initiatives to run canaries where they're not necessary.
> A system that isn't undergoing change
This is an assumption which may not always be valid for sufficiently complex systems/services. Systems may be distributed amongst multiple independent microservices that are versioned and deployed independently, with the interdependencies between microservices being very difficult to capture and validate at a per-microservice level.
For example, imagine microservice A receives events from microservice B which it then transforms and passes to microservice C. A deployment is made to service A which intentionally ignores/filters certain events from B, but contains a bug which causes A to filter additional events that it should not. C's low traffic alarms begin firing due to a drop in received events but from A's perspective, all is well (assuming the bug is overlooked both by its unit tests and integration tests). Likewise, B sees no issues since it is still able to successfully hand off events to A as normal. Thus, C experiences a steady state failure without having undergone any changes itself. Without an end-to-end canary, the system is stuck in a broken state until the oncall is able to figure out that the alerts generated by C have been triggered by a change in A (requiring them to subsequently manually roll A back). Alternatively, with an end-to-end canary, A's deployment is automatically rolled back and the oncall is alerted to a pipeline blockage which can be investigated while the system as a whole continues to function properly as normal.
Of course, it's certainly a possibility that such a scenario is indicative of poor architecture decisions, whether due to initial oversight or organic growth in complexity over time, but the reality is that such systems commonly exist and that it's very difficult to completely avoid such undesirable emergent behavior on sufficiently large timescales.
Even if your service is 'perfectly' well-architected on its own, it may have external dependencies out of its control and SLAs that require reporting when overall functionality is impacted by issues with those dependencies. A steady state failure may not be directly actionable but still require reporting to customers regardless of the underlying cause, and canaries are an excellent way to isolate and quantify such impacts independently of variable customer usage.
As you state, ideally your experimental setup would completely isolate variables to the extent that alerting can be completely self-contained/decoupled without external input & measurement, and one should strive to meet this goal regardless of whether canaries are implemented or not. However, at the end of the day, your system needs to be working end-to-end at all times and canaries can be a powerful and elegant mechanism to validate this without creating a gordian knot of granular entwined inter-microservice operational dependencies. Continuous canaries should not be your first line of defense against bad deployments, but they can serve well as complementary guardrails in many circumstances.