Hacker News new | ask | show | jobs
by notyourday 2334 days ago
The blast radius of CD covers company's bank accounts.

The only way to safely do CD in a customer facing non-toy project ( aka there's a revenue attached to the project ) is to have stack built from the ground up to support request pinning, request tracing, request shifting in addition to the regular instrumentation monitoring and 100% integration test coverage.

One can probably do it if that one is Google or Facebook. You aren't Google or Facebook. Don't be a hero. Don't do CD.

5 comments

There seems to be a misunderstanding regarding what Continuous Delivery (CD) is, as well as what is all required.

CD doesn't necessarily refer to Deployment of Delivered artifacts. Instead, once code is merged back into your release branch (sometimes master branch, sometimes tag, sometimes an actual release branch), it should be made available to be launched to production. This doesn't mean more testing isn't required and it doesn't mean that humans are not involved. Continuous Deployment is the process of releasing software to production automatically.

If you did in fact mean Continuous Deployment then you do not need 100% integration test coverage or anything of that nature. Typically, the process might work by releasing a canary deploy which receives a fraction of the traffic going to production. In the unlikely event a breaking change has silently made it through your other tests, it should likely be caught when it is deployed as a canary. If in the extremely unlikely scenario it makes it through both then the existing pipeline you have can be used to "roll-forward" a fix automatically instead of via the human process that is much more likely to fail.

This may seem like a lot to you, but you are front-loading quite a bit of work and a mature pipeline is much less error prone than manual deploys.

Note that "canary deployment" (I hate the usage of 'deploy' as noun.) is only possible under specific circumstances. You would not want to do that with your database, for instance (aka, "0.5% of your data might now be corrupt or lost) or your development tooling (aka, 0.5% of your deployed code might not do what it should do according to its source code).
You also need a large enough user base/level of traffic where releasing to a small percentage of your users would even uncover a problem.

I've had similar problems with A/B testing where a low amount of traffic means I would have to wait a long time to get meaningful data during which confounding variables naturally get introduced.

> If you did in fact mean Continuous Deployment then you do not need 100% integration test coverage or anything of that nature. Typically, the process might work by releasing a canary deploy which receives a fraction of the traffic going to production. In the unlikely event a breaking change has silently made it through your other tests, it should likely be caught when it is deployed as a canary

You have just described 100% integration test coverage.

No? They just described QA testers or actual users finding the bugs.

i.e. the change made it through the tests, and now is deployed live, where a human may encounter and report it.

That's applicable to a toy project.

If it is not a toy project then QA testing does not scale because there are thousands if not millions paths that request can flow through the infrastructure based on a user's actions.

That's why it is imperative to augment and nearly replace QA with end-to-end integration tests. Not only that but the integration tests must be funneled though the production gates to have a reasonable assurance that the code works in production configuration -- such as production request/task budgets, production latency and production tracking/tracing.

That in turn requires the test code being gated from production traffic ( which requires infrastructure to support request pinning ) and monitored/instrumented ( requires request tracing ) and actions ( error rate is 77 reqs/second when the test generates 4 req/sec means that the non-test traffic is landing on the test - remove test deploy immediately ).

So we are back to the integration tests nearly a 100% coverage to allow for unattended deploy.

I've done a lot of continuous deployment to staging and dev, with regular 'releases' to prod. It doesn't need to go straight out to prod every time you push some code.
If you are deploying to your company wide dev, then you either need to apply the same principals as you appy to the prod because you blowing up dev would make other developers using the environment non-productive, which costs money i.e. you have a company bank account in the blast radius.

If you are deploying to your own dev, then you do not need a CD because you are only doing deploys when you sit in front of your computer actively working, which means you can type a command/trigger a deploy when you need it.

The key insight here is that all code deployments have a risk of going wrong in a way that hurts the company’s bank account, and human beings following instructions to be careful do very little to mitigate that risk. This doesn’t mean CD is the right tradeoff for everyone, but it doesn’t change the problem space in some fundamental way - all of the tools you mention mitigate approximately the same risks regardless of how you develop your code.
CD makes developers careless because they start blindly relying on "it always works".
I would claim the opposite: CD makes each developer responsible for their own mess.

You make a mistake, it will end blowing up in production. Without CD you have a batch of changes, and the responsibility becomes diffused.

The main gain I see from not doing continuous deployment is to manage the expectations of customers on reliability during a deployment. (We are deploying every X, at Y, expect some turbulence).

> I would claim the opposite: CD makes each developer responsible for their own mess. > You make a mistake, it will end blowing up in production. > Without CD you have a batch of changes, and the responsibility becomes diffused.

If that's the case, then the project is too small and CD is a overhead.

If the teams are large and CD is used for deploys then you need to have the supporting infrastructure on every step i.e. end-to-end integration tests, instrumentation, monitoring and automatic actions based on the instrumentation and monitoring. That's the unsexy part that's missing.

> The main gain I see from not doing continuous deployment is to manage the expectations of customers on reliability during a deployment. (We are deploying every X, at Y, expect some turbulence).

Making deploys hitless should be a step taken before going into CD.

I don't think that's true. It's just that less automated pipelines have less of an expectation that new code releases are going to work in the first place. In fully traditional processes, where the release is manually built from scratch every N weeks, there's often no expectation at all - your customers just insist on a scheduled release day, and plan to spend it running around solving the problems the new release caused.
It's more insidious than that, IMO. We've created a culture where severe bugs impacting your customers are OK, even expected. Developers don't try very hard to preemptively stomp bugs, because there's no pressure for doing so.

I'm a bit old fashioned in this regard, but I'd prefer that we didn't accept that "bugs in production happen" as normal and instead treat the bugs as a failure of process.

IME that culture, where it exists, varies from organisation to organisation and is driven from the top.

I think a lot of companies don't care about bugs as much as their developers do though, and that isn't necessarily irrational behaviour.

Some kinds of customer facing bugs really are ok to just let happen.

> Some kinds of customer facing bugs really are ok to just let happen.

That's the exact mindset of the culture I mentioned. You can justify any kind of bug hitting production with it. It does the exact opposite of pushing things towards a better state; it aims for mediocrity.

Even if striving for zero bugs in production is functionally impossible, it encourages developers (and the entire company) to move towards quality, not away from it. Striving for 100% is what drives change.

Correct, it aims for mediocrity. If the customer requested 50 McDonald's hamburger, is it a cultural problem that you're not delivering one steak?
What is 'request pinning, request tracing, request shifting'?
Being able to pin a set of requests to a specific path inside the infrastructure to they touch only specific instances running specific versions without adding "if blah-blah-blah" custom code into every deploy

Being able to trace every request across the entire infrastructure including all sub-requests the original request triggered.

Being able to observe issues pinned requests caused, automatically trigger an action when an error rate exceeds the threshold, typically removing the pinning.

I'm asking myself the same thing... but I think he is referring to Load Balancing and Logging and maybe Public Key Pinning? Not sure... I'm however sure that Google and Facebook are not the only ones who do Load Balancing and Logging and therefore aren't the only ones who can benefit from CD :)
The opposite is true.