|
|
|
|
|
by goatinaboat
2243 days ago
|
|
What happened here is that the admins measured the thing that is easy for them to measure: the load. Nah, you have it completely backwards. If the users said “this specific job took 5 minutes today but was only 1 minute yesterday”, that’s actionable, you can e.g look at what changes were deployed overnight. But users always say “the system is slow”, even if they have only the vaguest idea of what “the system” is, and even if it’s actually faster than yesterday. It’s not really clear what any sysadmin can do other than spending hours every day painfully extracting the details from the user only to find nothing is wrong. Every day, forever. |
|
That's not true. It's just that most sysadmins don't bother to upskill to find out what they can and should be doing.
> painfully extracting the details from the user
Asking users for any information is a recipe for disaster. Much like witnesses to a murder that can't agree on the most basic details, users inevitably conflate totally unrelated things. E.g.:
"Citrix is slow?"
"Okay, how so... are button presses slow to respond to a click?"
"I couldn't log on. Something to do with my password. It's slow."
"ಠ_ಠ"
So don't ask. Don't rely on your users at all. Build synthetic transaction tests that act like users. Measure end-to-end latency. Sit down with them and watch them work. Don't rely on their verbal feedback, use your own eyes. Use your tools. Measure. Then measure some more.
Conversely, capacity metrics are largely irrelevant in the era of 10 Gbps networks and 64-core server CPUs. Focus on latency. Look for delays. Timeouts. Deadlocks. Firewall packet drops. That kind of thing.
> only to find nothing is wrong. Every day, forever.
Of course something is wrong! Something is practically always wrong, that's why the users are complaining!
Here's a fun rule of thumb for you: For every 1 user that complained, there are between 100 and 1,000 that had the same issue but shrugged it off and didn't call support.
I got that from a scientific paper. I couldn't believe it, so I measured it in a large 10K user system. The error-to-call ratio was about 500-800 in ours. It blew my mind, and it blew the minds of a lot of people in IT management.
We started gathering every error, tracking every possible latency measurement we could, and it was a horror show. 30K app crashes per day. I shit you not. That's about 3 per user per day! Data loss. Hangs. Login failure rate of nearly 50%.
It tooks months to triage the issues, push patches, and apply workarounds. We had to rewrite several components. We eventually got the errors down to less than a hundred per day. Believe me, that was a real achievement.
Users were so happy they were begging to be migrated to the new system instead of pushing back and refusing to upgrade.
If the users are complaining, something is probably very wrong and you just don't know it. Go look.