|
So, to preface this, so we're on the same page, a canary, in the context I'm speaking, is a specific type of comparison between a known good version X, and an unknown version X'. A canary is made up of a test and control pair, which should be as similar as possible (a common pattern is you sit both behind the same load balancer). This is done to isolate outside impact. We'll get to that more in a minute. >I would agree that it is not appropriate for all applications/systems, and in fact I have pushed against initiatives to run canaries where they're not necessary. You misunderstand me then. I think that canarying is almost always a net positive. I am saying that canarying is not a replacement for steady-state alerting. Again, a canary should not be capable of detecting steady-state alert issues. If it did, it would by definition not be a controlled experiment, which is your aim. >This is an assumption which may not always be valid for sufficiently complex systems/services. If you are not currently running X and X' on your canary, than any attempt to detect differences between your test and control should result only in statistical noise. If you get any result that isn't statistical noise, your canary is flawed. >Systems may be distributed amongst multiple independent microservices that are versioned and deployed independently, with the interdependencies between microservices being very difficult to capture and validate at a per-microservice level. This is true. There are a few strategies to deal with this. One is to not worry about it, and one is to fully isolate a test stack and control stack. I'll discuss how both of those don't work the way you describe in the rest of your post. So let's look at your microservice example. Imagine A, B, and C. We have a dataflow of B -> A -> C. Above I described two possible layouts. One where you have a fully isolated canary stack, and one where you don't. Let's look at the first one first. Although an aside first, I really hope you mean "a feature flag is updated on A'", your features really shouldn't be tied to deployments. B -> A -> C exists, as does B' -> A' -> C'. Note that B, B' and C, C' may be identical here. A request to B goes to B or B', and then through the test or control stack entirely. Your canarying tooling detects a regression between B and B'. Note that you don't need to "constantly" canary here. When you change A', you can check the differences between B and B', A and A', and C and C'. You need only monitor those in response to a change. The downside to this is that you've artificially cut your traffic, so it's unclear if A and A' will really be getting equivalent traffic now. If you push a change to B and to A at the same time, you can't isolate it to either specific component. There's also a number of other issues that come with this plan that just make it difficult. The other (and much, much more common) option is to not fully isolate the entire canary stack. Your layout then looks something like [B, B'] -> [A, A'] -> [C, C']. At each arrow, a request may be routed to either S or S', and importantly you don't know which (and it shouldn't matter!). Then traffic drops to C and C' both! In expectation, C and C' will each receive half the traffic A produces and half the traffic A' produces. So if A' is producing half as much traffic as before, both C and C' will notice QPS drop to 75%. But since all a canary does is compare the traffic to C with the traffic to C', a continual canary will notice no change. You still need alerting checking globally that traffic to [C, C'] hasn't dropped. So, if you have a proper un-isolated continual canary, it won't notice any change. If you do have an isolated continual canary, you're able to run the entire canary evaluation across stack in response to a change anywhere in the stack, so you still don't need to do it continually. >Even if your service is 'perfectly' well-architected on its own, it may have external dependencies out of its control and SLAs that require reporting when overall functionality is impacted by issues with those dependencies. A steady state failure may not be directly actionable but still require reporting to customers regardless of the underlying cause, and canaries are an excellent way to isolate and quantify such impacts independently of variable customer usage. No! These are all things that should be handled by non-canary alerting. Again, a well balanced canary will not detect traffic in an upstream service dropping to 0, because it will drop to zero in both the test and control environments, and comparisons between 0 and 0 are almost certainly within whatever acceptable bounds you have set. |
Ah, that's the issue. You are using what I consider an 'alternative' definition of "canary". The concepts you are referring to are what I would refer to, broadly, as applying to "pre prod" (a term which I think is somewhat misleading and inappropriate itself, but that's what people call it in my circles, so meh). API Gateway, and I presume other entities, uses "canary" the same way, so you are definitely not alone: https://docs.aws.amazon.com/apigateway/latest/developerguide.... I have nits on a few of your points but overall what you're saying is sensible. Glad I understand now! Cheers.