Hacker News new | ask | show | jobs
by sourceless 537 days ago
I think unfortunately the conclusion here is a bit backwards; de-risking deployments by improving testing and organisational properties is important, but is not the only approach that works.

The author notes that there appears to be a fixed number of changes per deployment and that it is hard to increase - I think the 'Reversie Thinkie' here (as the author puts it) is actually to decrease the number of changes per deployment.

The reason those meetings exist is because of risk! The more changes in a deployment, the higher the risk that one of them is going to introduce a bug or operational issue. By deploying small changes often, you get deliver value much sooner and fail smaller.

Combine this with techniques such as canarying and gradual rollout, and you enter a world where deployments are no longer flipping a switch and either breaking or not breaking - you get to turn outages into degradations.

This approach is corroborated by the DORA research[0], and covered well in Accelerate[1]. It also features centrally in The Phoenix Project[2] and its spiritual ancestor, The Goal[3].

[0] https://dora.dev/

[1] https://www.amazon.co.uk/Accelerate-Software-Performing-Tech...

[2] https://www.amazon.co.uk/Phoenix-Project-Helping-Business-An...

[3] https://www.amazon.co.uk/Goal-Process-Ongoing-Improvement/dp...

7 comments

> The reason those meetings exist is because of risk! The more changes in a deployment, the higher the risk that one of them is going to introduce a bug or operational issue.

Having worked on projects that were perfectly full CD and also projects that had biweekly releases with meetings with release engineers, I can state with full confidence that risk management is correlated but an indirect and secondary factor.

The main factor is quite clearly how much time and resources an organization invests in automated testing. If an organization has the misfortune of having test engineers who lack the technical background to do automation, they risk never breaking free of these meetings.

The reason why organizations need release meetings is that they lack the infrastructure to test deployments before and after rollouts, and they lack the infrastructure to roll back changes that fail once deployed. So they make up this lack of investment by adding all these ad-hoc manual checks to compensate for lack of automated checks. If QA teams lack any technical skills, they will push for manual processes as self-preservation.

To make matters worse, there is also the propensity to pretend that having to go through these meetings is a sign of excellence and best practices, because if you're paid to mitigate a problem obviously you have absolutely no incentive to fix it. If a bug leaks into production, that's a problem introduced by the developer that wasn't caught by QAs because reasons. If the organization has automated tests, it's even hard to not catch it at the PR level.

Meetings exist not because of risk, but because organizations employ a subset of roles that require risk to justify their existence and lack skills to mitigate it. If a team organizes it's efforts to add the bare minimum checks to verify a change runs and works once deployed, and can automatically roll back if it doesn't, you do not need meetings anymore.

This is very well said and succinctly summarizes my frustrations with QA. My experience has been that non-technical staff in technical organizations create meetings to justify their existence. I’m curious if you have advice on how to shift non-technical QA towards adopting automated testing and fewer meetings.
Hi, senior SRE here who was a QA, then QA lead, then lead automation / devops engineer.

QA engineers with little coding experience should be given simple automation tasks with similar tests and documentation/ people to ask questions to. I.e. setup a pytest framework that has a few automated test examples, and then have them write similar tests. The automated tests are just TAC (tests as code) versions of the manual test cases they should already write, so they should have some idea of what they need to do, and then google / ChatGPT/ automation engineers should be able to help them start to translate that to code.

People with growth mindsets and ambitions will grow from the support and being given the chance to do the things, while some small number will balk and not want anything to do with it. You can lead a horse to water and all that.

We are in the early stages of something like this in my org. QA has been writing tests in some form for a while, and it’s mostly been at a self-led level. We have a senior engineer per-application responsible for tooling and guidance, and the QA testers have been learning Java/script (depending on the application, teams we don’t interface with are writing theirs in C# iirc). With the new year, we are starting a phased initiative to ramp up all of QA to be Software Engineers in Testing - each phase will teach and guide and impart the skills needed to be fully sufficient to write automation tests in tandem with engineers writing features.

It’s an interesting and bold initiative imo, as I’ve often worked at places that let QA do whatever felt best which is good from the standpoint of letting them work within their comfort zone, and it also means that testing will largely plateau. I haven’t seen a real push for automation _not_ come out of the engineering department personally (because I’m the one pushing it every time), though I know this place has at least done some work with various automation systems in the past.

> The main factor is quite clearly how much time and resources an organization invests in automated testing.

For context, I think it's worth reflecting on Beck's background, eg as the author of XP Explained. I suspect he's taking even TDD for granted, and optimizing what's left. I think even the name of his new blog—"Tidy First"—is in reaction to a saturation, in his milieu, of the imperative to "Test First".

I think we may be violently agreeing - I certainly agree with everything you have said here.
I tend to agree. Whenever I've removed artificial technical friction, or made a fundamental change to an approach, the processes that grew around them tend to evaporate, and not be replaced. I think many of these processes are a rational albeit non-technical response to making the best of a bad situation in the absence of a more fundamental solution.

But that doesn't mean they are entirely harmless. I've come across some scenarios where the people driving decisions continued to reach for human processes as the solution rather than a workaround, for both new projects and projects designated specifically to remove existing inefficiencies. They either lacked the technical imagination, or were too stuck in the existing framing of the problem, and this is where people who do have that imagination need to speak up and point out that human processes need to be minimised with technical changes where possible. Not all human processes can be obviated through technical changes, but we don't want to spread ourselves thin on unnecessary ones.

So this seems quantifiable as well - there must be a number of processes / components that a business is made up of, and those presumably are also weighted (payment processing has weight 100, HR holiday requests weight 5 etc).

I would conjecture that changing more than 2% of processes in any given period is “too much” - but one can certainly adjust that.

And I suspect that this modifies based on area (ie the payment processing code has a different team than the HR code) - so it would be sensible to rotate releases (or possibly teams) - this period this team is working on the hard stuff, but once that goes live the team is rotated back out to tackle easier stuff - either payment processing or HR

The same principle applies to attacking a trench, moving battalions forward and combined arms operations.

Now that is of course a “management” problem - but one can easily see how to automate a lot of it - and how other “sensory” inputs are useful (ie which teams have committed code to these sensitive modules recently

One last point is it makes nonsense of “sprints” in Agile/Scrum - we know you cannot sprint a whole marathon, so how do you prepare the sprints for rotation?

There are no sprints in agile. ;)

On the contrary, per the Manifesto:

> Agile processes promote sustainable development.

> The sponsors, developers, and users should be able to maintain a constant pace indefinitely.

I am really interested in organizations capacity of soaking the changes.

I live in B2B SaaS space and as much as development goes we could release daily. But on the receiving side we get pushback. Of course there can be feature flags but then it would cause “not enabled feature backlog”.

In the end features are mostly consumed by people and people need training on the changes.

I think that really depends on the product. I worked on a on-prem data product for years and it was crucial to document all changes well and give customers time to prepare. OTOH I also worked on a home inspection app and there users gave us pushback on training because the app was seen as intuitive
> ...there users gave us pushback on training because the app was seen as intuitive

I would weep with joy to receive such feedback! Too often the services I work on have long histories with accidental UIs, built to address immediate needs over and over.

This was a greenfield app. For all I know by now accommodating edge cases that almost never matter has made the thing unusable.
I agree entirely - I use the same references, I just think it's bordering on sacrilege what you did to Mr. Goldratt. He has been writing about flow and translating the Toyota Production System principles and applying physics to business processes way before someone decided to write The Phoenix Project.

I loved the Phoenix Project don't get me wrong, but compared to The Goal it's a like a cheaply produced adaptation of a "real" book so that people in the IT industry don't get scared when they read about production lines and run away saying "but I'm a PrOgrAmmEr, and creATIVE woRK can't be OPtiMizEd like a FactOry".

So The Phoenix Project if anything is the spiritual successor to The Goal, not the other way around.

That’s exactly what the GP wrote: The Goal is the spiritual ancestor of The Phoenix Project.
Well now I can't tell if it was edited or if I just misread and decided to correct my own mistake. I'll leave it be so I remember next time, thanks.
That's indeed how I wrote it, but I could have worded it better. Very much agree that the insights in The Goal go far beyond the scope of The Phoenix Project.
I totally read it as successor as well. Interesting how the brain fills in what we expect to see :)
> By deploying small changes often, you get deliver value much sooner and fail smaller.

Which increases the number of changes per deployment, feeding the overhead cycle.

He is describing an emergent pattern here, not something that requires intentional culture change (like writing smaller changes). You’re not disagreeing but paraphrasing the article’s conclusion:

> or the harder way, by increasing the number of changes per deployment (better tests, better monitoring, better isolation between elements, better social relationships on the team)

I am disagreeing with the conclusion of the article, and asserting that more and smaller deployments are the better way to go.
You are not. The conclusion of the article is the same, you "need to expand the far end of the hose" by increasing deployment rate or making more, smaller changes. What was your interpretation?
My reading was that there were two paths the author highlights:

1) Increase deployment capacity (which I'm reading as frequency, and I fully agree with)

2) Increase change capacity per deployment by making it less likely that a set of changes will fail through tests, monitoring, structural, and team changes

#2 is very much geared to "ship more changes in one deployment" which is where my disagreement lies. I think you should still do all those things, but that increasing the size of the bundle is explicitly an anti-goal.

I think you're better off, as a rule of thumb, making fewer changes per deployment if you want to reduce risk.

But -- that is my particular reading of it.

My reading is that the author posits there is a fixed amount of change that can be safely made in a single deployment. The solution is to make it possible to deploy more frequently. This is hard, so organizations will often instead introduce overhead that slows down changes. Engineers might be tempted to blame the overhead and try to eliminate it, but that won't be successful and may even backfire. They need to tackle the underlying issue of deployment capacity instead.
this isn't even a software things. Its any production process. The greater amount of work in progress items, the longer the work in progress items, the greater risk, the greater amount of work. Shrink the batch, shorten the release window window.

It infuriates me that software engineering has had to rediscover these facts when the Toyota production system was developed between 1948-1975 and knew all these things 50 years ago.