Hacker News new | ask | show | jobs
by ChazDazzle 1042 days ago
> I learnt a few things from this. The first is, some companies or organisations within them need failure in order to progress.

Encountered a similar situation a few years back while working for a government client. My advice to the team was that we needed to “let the train wreck happen.”

3 comments

I agree with this, but I think you also have to ready to document the course of events and (in the most diplomatically possible way) say “I told you so”.

Otherwise the people who are accountable for the failure will say “oh who could have seen this coming” and refuse to learn anything for the next instance.

Yeah this stood out to me as well.

I think the reason for this is that observed production failure is _certain_. Hard fails are undeniable. They obviously need to be addressed and obviously deserve resources. The amount of deserved resources can be clearly calculated by projecting the concrete, observed costs of the failure forward in time.

Before prod goes down, there is much more uncertainty:

- even an expert engineering assessment has some level of uncertainty

- engineering may not have a full appreciation for the business context of the work, and might over-weight technical issues relative to other concerns

- if the engineers are contractors, or otherwise organizationally distant from the experience owners, that inserts a trust gap which further increases uncertainty

- the business owner’s projections are themselves uncertain. Is the expected launch volume really that high, or is it aspirational?

- the costs of failure are uncertain too… if the system goes down, how hard will it go down? What will that actually cost in lost revenue? Fuzzier stuff like brand reputation is even harder to quantify.

Meanwhile the costs of paying the contracted development team another 2 months on the same project are quite concrete. The team already spent significant political capital to force a change on an incumbent team. Now they’re saying they want more money because it still doesn’t work??

The big open question is - what was the cost of the failed launch? How long did it take to get the system back up and running at scale? What did it cost in terms of user loyalty? How does that compare to the concrete cost of holding launch until the auth system was upgraded?

Different people will answer those questions in different ways. What matters is how the customer answers those questions, whether their bosses believe that answer, and their bosses judgment of the overall situation.

Yea this is a good take, but it’s not as pessimistic as it sounds. “I told you so” never works, even if you explicitly reserve the right so use it, and even as a joke.

What does work though is having a prepared refactoring plan ready to go when it all blows up. “Do that risky refactor we’ve been avoiding because of downtime?, well it’s all down now isn’t it”