Hacker News new | ask | show | jobs
by Arainach 5 hours ago
This is a good post, but once again incentives rear their ugly head.

> Second, preventing or mitigating an incident early (even by just knowing the right feature flag to turn off) can save huge amounts of money: both immediate lost revenue during the incident and future lost revenue from customers who would have pulled their business or refused to sign pending contracts.

Time and time again at many companies, including well-reputed ones, I have seen that preventing issues gets you no recognition, but building a giant pile of kindling and then putting out the inevitable fire will get you recognition twice. Even in "good" orgs.

I've never been able to commit to the game theory politics enough to intentionally ship garbage fast and take that credit - I take too much pride in my work - but I have spent 5+ years managing and growing a framework designed to eliminate classes of issues that plagued the last version of our product and watched as partner teams who ship garbage code and cause outages get public credit for fixing those outages and my team, despite attempting to advocate, get no credit for not having such outages because you can't measure that.

6 comments

The game theory should make is that those teams that recurring lay lose customers due to issues will be punished accordingly. If they aren’t, then maybe the problems that result from shipping fast don’t impact customer retention as much as might think.
Not necessarily. Not every job is shipping features that are visible to customers, or even to management.

You see this pattern of "make fire, put it out, get rewarded" a lot on devops type teams, almost always by the lead (IME). Often it is very difficult to determine customer impact of these types of events, especially if monitoring/alerting is lacking (very common), and even if it isn't, often these same teams have the ability to turn those knobs any way they want anyway.

This works across small entities like companies with distinct customers and budgets.

It does not work for large corporations with pools of billions of dollars and various incentives to staying within the ecosystem. It's impossible to measure the contribution of one feature team to perception and retention of something like "Microsoft Intune" or "Google Chrome", and without the ability to measure that no effective check on those teams.

The most spectacular instance of this I've seen is Jeffrey Snover getting demoted for "forcing" PowerShell onto Microsoft. Meanwhile from a customer perspective its the only good thing about Windows Server and the only reason I haven't pushed for 100% Linux adoption everywhere I work!

See: https://corecursive.com/building-powershell-with-jeffrey-sno...

Exactly game theory is that is that everyone make more as a "Senior" or "Mid-Level" in a wealthy/successful org over a "Staff" or "Senior" at a poorer one with less customers.

Of course, game theory implies "infinite games" and of course the real world doesn't operate like that.

And large bureaucratic orgs collapse under their own weight, and the enshittification is the norm despite the number of paying customers.

Thread.sleep(100000) everywhere until it breaks. Then lo and behold you have fought fires longly and bravely until midnight on Friday after the release. Don't ask me why it's rewarded, and, of course every now and then they switch to rewarding different things.
That might work until someone else in the org gets curious and checks the git history… I suppose some clever obfuscation might be enough to get around that but then at that point you’re basically writing malware for your own product…
Correct multi-threaded code is... sss... hard.

Much easier to liberally sprinkle mutex locks and "Thread.Sleep(1000); // Quick fix" everywhere until the problems almost always go away!

Meanwhile the guy screaming that this is eldritch madness and can't ever work is "not a team player" because the guy that wrote the code was a hero for applying yet another layer of band aids to the gaping wounds.

> get no credit for not having such outages because you can't measure that.

Well from a philosophical point I would argue that you can measure the weight of nothing too.

Yes, this is completely true unfortunately but not the only way.

A good honest approach is just to build a few complex but essential tools so that other engineers have to keep coming back to you. It's a good way to stay relevant. You become really good at identifying misuses of that particular tool and it makes you look way smarter than you are when you can identify bugs in other people's code in mere seconds. This tends to happen naturally as you become more familiar with all the common gotchas that people tend to run into when using your tool.

Ideally you want your tools to be reliable and useful but complex... That way, whenever other devs run into bugs while using your tool, they keep coming back to you and you can point out their mistakes. The mistakes must be almost always be on their side for the strategy to work; this is key. Your code has to be rock solid.

If they find a genuine bug in your code, hopefully a small edge case, you have to be very humble and apologetic about it and you should praise the developer in the team meeting for identifying this complex bug.

This approach is better than getting credit for fixing your own buggy code; that only works with management and junior devs but other senior engineers will hate you.

The approach of building complex but reliable tools gets you credit over and over (often much more than twice) and the approval you get from other devs eventually finds its way to managers' ears. Smart leaders know that this is a better signal than flashy demos.

The leaders who just dish out praise onto specific devs for producing prototypes quickly tend to learn their mistake sooner or later. Many young founders tend to go through this phase though when they praise superficialties.

With the way you're framing your opposition, I agree with you. But I'd like to add some nuance. Parts of building a product or a set of features is about search, rather than great engineering. Sometimes it's better to build two good-enough features to figure out which one is valuable to the user, rather than building one solid* one. I've always been in the "let's fuck around and find out" camp. I appreciate that someone with a different attitude built git! Just saying that there's a balance here, which will depend on the degree to which you're in the middle of a search problem.

*solid in a pure engineering sense - availability, maintainability, chance of leaking the users' nudes etc.

Not totally true. You can measure page amounts per team or how heavy oncall is.
You can't prove that "this work caused us to not get paged" versus "that work is unnecessary and you wouldn't have been paged regardless".

Even when you can, you can't prove the impact. As a real example, our team has extensive presubmit infrastructure to catch and block some classes of configuration error that have caused customer data corruption in the past. There have been CLs which were caught by those presubmits and meant that we didn't have outages, but there's no dollar amount tied to an outage that didn't exist.

Meanwhile, team X did something similar that caused data corruption, had N customers affected for such a period of time, scrambled to root cause, roll back, and restore from backups, getting customers back up and online. Look how responsive and great they are!

You can have before and after data and track trends. How did you know the issues was wide spread in the first place. You must have some proof somewhere.

The impact is how many outages overall. If you only prevent one outage then maybe it's not that meaningful.

Your last paragraph, your right that happens in the short term. In the long term those teams get reputations for being a shit show, there will be high turnover, good engineers won't transfer in, people's compentaencies start to get questioned, other teams will avoid working with that team and develop their own solutions, and higher up people will start to look at what's going on.

> those teams get reputations for being a shit show,

Reputations with who? The VPs who rotate in and out every few years (if you're lucky enough to go a few years between reorgs) for a new title and salary bump?

> there will be high turnover, good engineers won't transfer in,

On the contrary, many people want to work on the team that gets visibility where people can actually get promoted rather than having to justify their existence constantly