| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Arainach 51 days ago

This is a good post, but once again incentives rear their ugly head.

> Second, preventing or mitigating an incident early (even by just knowing the right feature flag to turn off) can save huge amounts of money: both immediate lost revenue during the incident and future lost revenue from customers who would have pulled their business or refused to sign pending contracts.

Time and time again at many companies, including well-reputed ones, I have seen that preventing issues gets you no recognition, but building a giant pile of kindling and then putting out the inevitable fire will get you recognition twice. Even in "good" orgs.

I've never been able to commit to the game theory politics enough to intentionally ship garbage fast and take that credit - I take too much pride in my work - but I have spent 5+ years managing and growing a framework designed to eliminate classes of issues that plagued the last version of our product and watched as partner teams who ship garbage code and cause outages get public credit for fixing those outages and my team, despite attempting to advocate, get no credit for not having such outages because you can't measure that.

9 comments

netcoyote 51 days ago

One of the tricks that we can use as good managers is code ownership. The folks who wrote the code are the ones who get to fix the bugs in the code.

While they’re busy fixing their own problems, the teams that wrote outage-free code get first dibs on writing new systems.

On the (online game) teams I worked on there are an infinite number of new & exciting systems needed, so this approach means that the best developers are the ones building them.

FromTheFirstIn 51 days ago

Great as long as no one ever leaves, but the second someone does suddenly I’m being punished for owning their idiocy. And people are always leaving

DiskoHexyl 50 days ago

Especially the kinds of people people who tend to create such monstrosities- they either move up or move on (to the next victim)

nitwit005 50 days ago

The concept of the responsible party bearing the costs is a good one, but if we're honest about who that is, it's often going to be company leadership.

The person who made the breaking change is often diligently following instructions to get it done as soon as possible.

_doctor_love 50 days ago

That’s an extremely game centric point of view. Game devs more than just about anyone else are strongly identified with their code and have an artist attitude about. In a non game environment the psychology is different.

netcoyote 50 days ago

That's a great point, and a fair criticism!

jongjong 51 days ago

Yes, this is completely true unfortunately but not the only way.

A good honest approach is just to build a few complex but essential tools so that other engineers have to keep coming back to you. It's a good way to stay relevant. You become really good at identifying misuses of that particular tool and it makes you look way smarter than you are when you can identify bugs in other people's code in mere seconds. This tends to happen naturally as you become more familiar with all the common gotchas that people tend to run into when using your tool.

Ideally you want your tools to be reliable and useful but complex... That way, whenever other devs run into bugs while using your tool, they keep coming back to you and you can point out their mistakes. The mistakes must be almost always be on their side for the strategy to work; this is key. Your code has to be rock solid.

If they find a genuine bug in your code, hopefully a small edge case, you have to be very humble and apologetic about it and you should praise the developer in the team meeting for identifying this complex bug.

This approach is better than getting credit for fixing your own buggy code; that only works with management and junior devs but other senior engineers will hate you.

The approach of building complex but reliable tools gets you credit over and over (often much more than twice) and the approval you get from other devs eventually finds its way to managers' ears. Smart leaders know that this is a better signal than flashy demos.

The leaders who just dish out praise onto specific devs for producing prototypes quickly tend to learn their mistake sooner or later. Many young founders tend to go through this phase though when they praise superficialties.

derangedHorse 51 days ago

The game theory should make is that those teams that recurring lay lose customers due to issues will be punished accordingly. If they aren’t, then maybe the problems that result from shipping fast don’t impact customer retention as much as might think.

JohnMakin 51 days ago

Not necessarily. Not every job is shipping features that are visible to customers, or even to management.

You see this pattern of "make fire, put it out, get rewarded" a lot on devops type teams, almost always by the lead (IME). Often it is very difficult to determine customer impact of these types of events, especially if monitoring/alerting is lacking (very common), and even if it isn't, often these same teams have the ability to turn those knobs any way they want anyway.

Arainach 51 days ago

This works across small entities like companies with distinct customers and budgets.

It does not work for large corporations with pools of billions of dollars and various incentives to staying within the ecosystem. It's impossible to measure the contribution of one feature team to perception and retention of something like "Microsoft Intune" or "Google Chrome", and without the ability to measure that no effective check on those teams.

jiggawatts 51 days ago

The most spectacular instance of this I've seen is Jeffrey Snover getting demoted for "forcing" PowerShell onto Microsoft. Meanwhile from a customer perspective its the only good thing about Windows Server and the only reason I haven't pushed for 100% Linux adoption everywhere I work!

See: https://corecursive.com/building-powershell-with-jeffrey-sno...

Noumenon72 49 days ago

What a story. Thanks for the link.

PrimalPower 51 days ago

Exactly game theory is that is that everyone make more as a "Senior" or "Mid-Level" in a wealthy/successful org over a "Staff" or "Senior" at a poorer one with less customers.

Of course, game theory implies "infinite games" and of course the real world doesn't operate like that.

And large bureaucratic orgs collapse under their own weight, and the enshittification is the norm despite the number of paying customers.

rgavuliak 46 days ago

Ultimately the management should care about what brings in the most revenue. A feature that brings low revenue but has no fire is something no-one cares about. If you have a ton of revenue and no bugs, you likely can get recognition.

dj_axl 51 days ago

Thread.sleep(100000) everywhere until it breaks. Then lo and behold you have fought fires longly and bravely until midnight on Friday after the release. Don't ask me why it's rewarded, and, of course every now and then they switch to rewarding different things.

jiggawatts 51 days ago

Correct multi-threaded code is... sss... hard.

Much easier to liberally sprinkle mutex locks and "Thread.Sleep(1000); // Quick fix" everywhere until the problems almost always go away!

Meanwhile the guy screaming that this is eldritch madness and can't ever work is "not a team player" because the guy that wrote the code was a hero for applying yet another layer of band aids to the gaping wounds.

yurishimo 51 days ago

That might work until someone else in the org gets curious and checks the git history… I suppose some clever obfuscation might be enough to get around that but then at that point you’re basically writing malware for your own product…

rightbyte 51 days ago

> get no credit for not having such outages because you can't measure that.

Well from a philosophical point I would argue that you can measure the weight of nothing too.

johnbcoughlin 50 days ago

The quoted section is about stepping in early with the right knowledge to fix an incident that's in progress. That is, putting out the fire.

Arainach 50 days ago

It begins with `preventing or` before talking about mitigating/fixing. There's no such thing as "preventing early" in firefighting.

prolly97 51 days ago

With the way you're framing your opposition, I agree with you. But I'd like to add some nuance. Parts of building a product or a set of features is about search, rather than great engineering. Sometimes it's better to build two good-enough features to figure out which one is valuable to the user, rather than building one solid* one. I've always been in the "let's fuck around and find out" camp. I appreciate that someone with a different attitude built git! Just saying that there's a balance here, which will depend on the degree to which you're in the middle of a search problem.

*solid in a pure engineering sense - availability, maintainability, chance of leaking the users' nudes etc.

tayo42 51 days ago

Not totally true. You can measure page amounts per team or how heavy oncall is.

Arainach 51 days ago

You can't prove that "this work caused us to not get paged" versus "that work is unnecessary and you wouldn't have been paged regardless".

Even when you can, you can't prove the impact. As a real example, our team has extensive presubmit infrastructure to catch and block some classes of configuration error that have caused customer data corruption in the past. There have been CLs which were caught by those presubmits and meant that we didn't have outages, but there's no dollar amount tied to an outage that didn't exist.

Meanwhile, team X did something similar that caused data corruption, had N customers affected for such a period of time, scrambled to root cause, roll back, and restore from backups, getting customers back up and online. Look how responsive and great they are!

tayo42 51 days ago

You can have before and after data and track trends. How did you know the issues was wide spread in the first place. You must have some proof somewhere.

The impact is how many outages overall. If you only prevent one outage then maybe it's not that meaningful.

Your last paragraph, your right that happens in the short term. In the long term those teams get reputations for being a shit show, there will be high turnover, good engineers won't transfer in, people's compentaencies start to get questioned, other teams will avoid working with that team and develop their own solutions, and higher up people will start to look at what's going on.

Arainach 51 days ago

> those teams get reputations for being a shit show,

Reputations with who? The VPs who rotate in and out every few years (if you're lucky enough to go a few years between reorgs) for a new title and salary bump?

> there will be high turnover, good engineers won't transfer in,

On the contrary, many people want to work on the team that gets visibility where people can actually get promoted rather than having to justify their existence constantly