Hacker News new | ask | show | jobs
by parasense 634 days ago
This sounds like a cliche stereotypical IT problem. And firstly, not a not a bad thing, because it's new to you. Luckily there are mountains of best-practices for addressing this issue. Picking one feather from the big pile, I'd say your situation screams of Problem Management.

https://wiki.en.it-processmaps.com/index.php/Problem_Managem...

Your on-calls folks need a way to be free of the broader problem analysis, and focus on putting out the fires. The folks in problem management will take the steps to prevent problems from ever manifesting.

Once upon a time I was into Problem Management, and one issue that kept coming up was server OS patching where the Linux systems crashed upon reboot, after having applied new kernel, etc. The customers were blaming us, and we were blaming the customer, and round and round it went. Anyhow, the new procedure was some thing like this... any time there was routine maintenance that would result in the machine rebooting (e.g. kernel updates), then the whole system had to be brought down first to prove it was viable for upgrades. Low-and Behold, machines belonging to a certain customer had a tendency to not recover after the pre-reboot. This would stop the upgrade window in it's track, and I would be given a ticket for next day to investigate why the machine was unreliable. Hint... a typical problem was Oracle admins playing god with /etc/fstab, and many other shenanigans. We eventually got that company to a place where the tier-2 on-call folks could have a nice life outside of work.

But I digress...

> Opex ...

Usually that term means "Operational Expenditure", as opposed to "Capex" or Capital Expenditure. It's your terminology, so it's fine, but I'd NOT say those kind of things to anybody publicly. You might get strange looks.

I'd say let one or two of the on-call folks be given a block of a few hours each week to think of ways to kill recurring issue. Let them take turns, and give them concrete incentives to achieve results. Something like $200 bonus per resolved problem. That leads us into the next issue, which is monitoring and logging of the issues. Because if you hired consultants to come-in tomorrow, and you don't even have stats... there's nothing anybody could do.

Good luck