| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by TehShrike 622 days ago
	If you work a job where having one of these incidents once a year or more is "normal" then the dev team needs to devote most of its time to fixing that, or you need to change employers.

4 comments

TehShrike 622 days ago

Another way to say: for most software jobs, customer-facing downtime is downstream of a development skill issue that can+should be fixed.

Some (many?) employers make this difficult, and you should try to leave them.

link

erikaww 622 days ago

I think you mean organizational issue not development skill issue. If shit is constantly hitting the fan, that is the orgs fault, not the engineers

link

TehShrike 622 days ago

Yeah, probably.

What I mean to imply is that it is an issue that is naturally fixed by improved development, and that fixing does require development skill, but the organization can hamstring their developers to prevent them fixing the issue even if they could.

link

rsanek 622 days ago

an incident only once a year is an absurd bar. I'm no fan of on call but ensuring that level of incident avoidance would force the company to move at glacial speeds, which is even worse over the long term than getting paged.

I think my sweet spot is somewhere between once a week and once a month, spread across the whole team.

link

TehShrike 620 days ago

an incident that requires immediate developer intervention, rather than waiting until tomorrow? It seems like you would have to go out of your way to create a system so fragile that this happened once a month

link

grecy 622 days ago

I worked at a telco that served a few tens of thousands of customers in a huge remote region.

There are so many systems held together with baling wire it was rare to go a day without a significant outage, usually multiple. Everyone who was remotely knowledgeable about tech was basically a firefighter.

link

aprilthird2021 622 days ago

I don't think this takes into account the reality of huge megacorps with tons of development teams situated globally who are constantly changing the codebase.

Incidents happen as code changes. Even once you fix it, the changing nature of the code can introduce more issues

link

TehShrike 622 days ago

I've never worked at a megacorp, but if megacorp employees believe that it is more acceptable for them to cause issues for customers than a 3-dev company, that really seems like a skill issue for the megacorp.

If it is unacceptable to cause that downtime, you write code that makes the downtime much less likely

link

collingreen 622 days ago

I expect the scale here is not apples to apples. A three person team is often on a small product and downtime is often a catastrophe like truly broken for customers. Meanwhile a megacorp is often many many large products and downtime usually means a piece of one of them is degraded.

My random guess is that the "downtime" is fairly proportional to the scale difference with megas probably taking the edge.

link

beacon294 622 days ago

Often it's from slippage between 2 teams systems where a contract never existed. Often even the relationship causing the incident is unclear.

link