Hacker News new | ask | show | jobs
by joshuamorton 2719 days ago
> Canary failures at other times trigger steady state alarms. Thus, our canaries serve both to validate new versions and continuously monitor the existing versions.

This is bad.

Canaries are not alerts. Alerting should be separate from canary. If you're getting steady state alerts from your canaries, you're steady-state alerting is bad.

A system that isn't undergoing change should be able to entirely disable its canary, and you should still remain confident that any changes in traffic or whatnot should be handled by other alerting tools. If not, then your canaries aren't correctly balanced, and your getting canary failures due to traffic imbalances or something that shouldn't be affecting a good canary. In other words, if you're canary is alerting in steady state, your experimental setup is invalid and you don't have a good test/control pair. If you did, the only steady state alerts you'd get would be noise.

You can't have the same experimental setup control for production differences to isolate changes due to binary version bumps, while also detecting changes in production traffic independent of binary version bumps. At least one of those is broken.

1 comments

> This is bad.

I would agree that it is not appropriate for all applications/systems, and in fact I have pushed against initiatives to run canaries where they're not necessary.

> A system that isn't undergoing change

This is an assumption which may not always be valid for sufficiently complex systems/services. Systems may be distributed amongst multiple independent microservices that are versioned and deployed independently, with the interdependencies between microservices being very difficult to capture and validate at a per-microservice level.

For example, imagine microservice A receives events from microservice B which it then transforms and passes to microservice C. A deployment is made to service A which intentionally ignores/filters certain events from B, but contains a bug which causes A to filter additional events that it should not. C's low traffic alarms begin firing due to a drop in received events but from A's perspective, all is well (assuming the bug is overlooked both by its unit tests and integration tests). Likewise, B sees no issues since it is still able to successfully hand off events to A as normal. Thus, C experiences a steady state failure without having undergone any changes itself. Without an end-to-end canary, the system is stuck in a broken state until the oncall is able to figure out that the alerts generated by C have been triggered by a change in A (requiring them to subsequently manually roll A back). Alternatively, with an end-to-end canary, A's deployment is automatically rolled back and the oncall is alerted to a pipeline blockage which can be investigated while the system as a whole continues to function properly as normal.

Of course, it's certainly a possibility that such a scenario is indicative of poor architecture decisions, whether due to initial oversight or organic growth in complexity over time, but the reality is that such systems commonly exist and that it's very difficult to completely avoid such undesirable emergent behavior on sufficiently large timescales.

Even if your service is 'perfectly' well-architected on its own, it may have external dependencies out of its control and SLAs that require reporting when overall functionality is impacted by issues with those dependencies. A steady state failure may not be directly actionable but still require reporting to customers regardless of the underlying cause, and canaries are an excellent way to isolate and quantify such impacts independently of variable customer usage.

As you state, ideally your experimental setup would completely isolate variables to the extent that alerting can be completely self-contained/decoupled without external input & measurement, and one should strive to meet this goal regardless of whether canaries are implemented or not. However, at the end of the day, your system needs to be working end-to-end at all times and canaries can be a powerful and elegant mechanism to validate this without creating a gordian knot of granular entwined inter-microservice operational dependencies. Continuous canaries should not be your first line of defense against bad deployments, but they can serve well as complementary guardrails in many circumstances.

So, to preface this, so we're on the same page, a canary, in the context I'm speaking, is a specific type of comparison between a known good version X, and an unknown version X'. A canary is made up of a test and control pair, which should be as similar as possible (a common pattern is you sit both behind the same load balancer). This is done to isolate outside impact. We'll get to that more in a minute.

>I would agree that it is not appropriate for all applications/systems, and in fact I have pushed against initiatives to run canaries where they're not necessary.

You misunderstand me then. I think that canarying is almost always a net positive. I am saying that canarying is not a replacement for steady-state alerting. Again, a canary should not be capable of detecting steady-state alert issues. If it did, it would by definition not be a controlled experiment, which is your aim.

>This is an assumption which may not always be valid for sufficiently complex systems/services.

If you are not currently running X and X' on your canary, than any attempt to detect differences between your test and control should result only in statistical noise. If you get any result that isn't statistical noise, your canary is flawed.

>Systems may be distributed amongst multiple independent microservices that are versioned and deployed independently, with the interdependencies between microservices being very difficult to capture and validate at a per-microservice level.

This is true. There are a few strategies to deal with this. One is to not worry about it, and one is to fully isolate a test stack and control stack. I'll discuss how both of those don't work the way you describe in the rest of your post.

So let's look at your microservice example. Imagine A, B, and C. We have a dataflow of B -> A -> C. Above I described two possible layouts. One where you have a fully isolated canary stack, and one where you don't. Let's look at the first one first. Although an aside first, I really hope you mean "a feature flag is updated on A'", your features really shouldn't be tied to deployments.

B -> A -> C exists, as does B' -> A' -> C'. Note that B, B' and C, C' may be identical here. A request to B goes to B or B', and then through the test or control stack entirely. Your canarying tooling detects a regression between B and B'. Note that you don't need to "constantly" canary here. When you change A', you can check the differences between B and B', A and A', and C and C'. You need only monitor those in response to a change. The downside to this is that you've artificially cut your traffic, so it's unclear if A and A' will really be getting equivalent traffic now. If you push a change to B and to A at the same time, you can't isolate it to either specific component. There's also a number of other issues that come with this plan that just make it difficult.

The other (and much, much more common) option is to not fully isolate the entire canary stack. Your layout then looks something like [B, B'] -> [A, A'] -> [C, C']. At each arrow, a request may be routed to either S or S', and importantly you don't know which (and it shouldn't matter!). Then traffic drops to C and C' both! In expectation, C and C' will each receive half the traffic A produces and half the traffic A' produces. So if A' is producing half as much traffic as before, both C and C' will notice QPS drop to 75%. But since all a canary does is compare the traffic to C with the traffic to C', a continual canary will notice no change. You still need alerting checking globally that traffic to [C, C'] hasn't dropped.

So, if you have a proper un-isolated continual canary, it won't notice any change. If you do have an isolated continual canary, you're able to run the entire canary evaluation across stack in response to a change anywhere in the stack, so you still don't need to do it continually.

>Even if your service is 'perfectly' well-architected on its own, it may have external dependencies out of its control and SLAs that require reporting when overall functionality is impacted by issues with those dependencies. A steady state failure may not be directly actionable but still require reporting to customers regardless of the underlying cause, and canaries are an excellent way to isolate and quantify such impacts independently of variable customer usage.

No! These are all things that should be handled by non-canary alerting. Again, a well balanced canary will not detect traffic in an upstream service dropping to 0, because it will drop to zero in both the test and control environments, and comparisons between 0 and 0 are almost certainly within whatever acceptable bounds you have set.

> So, to preface this, so we're on the same page, a canary, in the context I'm speaking, is a specific type of comparison between a known good version X, and an unknown version X'.

Ah, that's the issue. You are using what I consider an 'alternative' definition of "canary". The concepts you are referring to are what I would refer to, broadly, as applying to "pre prod" (a term which I think is somewhat misleading and inappropriate itself, but that's what people call it in my circles, so meh). API Gateway, and I presume other entities, uses "canary" the same way, so you are definitely not alone: https://docs.aws.amazon.com/apigateway/latest/developerguide.... I have nits on a few of your points but overall what you're saying is sensible. Glad I understand now! Cheers.