| Also looking for readings on best practices. Curious about your company's on-call for details about: * expected duties (only answer on-call, do other work, etc) * how deep does your on-call dive into code to fix, vs triaging, patching, and creating follow up work to fix permanently? * priority order of work (on-call tickts vs regular work vs existing on-call tickets vs special queue of simple tasks) * what happens when the current on-call can't complete in time? * how do you manage for other teams' risk? (ie their api goes down, you can't satisfy your customers) * any other tidbits |
Why? It says a lot when a company doesn't put the effort into various forms of testing and QA to ensure that production software does not have critical issues that warrant at 2am call. Unit, functional, integration, load, and simulation tests should be written for every single piece of critical infrastructure. You should be hammering these things in staging environments with 10-100x of your normal peak load.Use something like Gore to replay live traffic against a version in staging or QA environments. Yes, that takes work, but to me it's better upfront than to wake me up in the middle of the night or to know that when I go home I have to have my phone around me at all times. The business should care about these things too; it's their product and they should care enough about you to make sure good processes are in place to ensure quality production software.
That said, when I was at non-on-call companies there are definitely times when something does happen that warrants immediate attention. Generally someone in operations would get the first call, they check logs, diagnose the issue, and call a developer familiar with the app that's causing the issues if that's the case. I don't mind waking up because I know it has to be something serious that slipped past our processes.