| * expected duties (only answer on-call, do other work, etc) Expected duty is to solve the problem but there are often escalation paths to take. If a problem is not solved within a required SLA period then the company can be forced to pay the client penalty fees. * how deep does your on-call dive into code to fix, vs triaging, patching, and creating follow up work to fix permanently? They go as deep as necessary, or as deep as is dictated by an operational manual pertaining to that particular client/environment. * priority order of work (on-call tickts vs regular work vs existing on-call tickets vs special queue of simple tasks) On-call is always higher priority since it's an add-on service that clients pay for. * what happens when the current on-call can't complete in time? See above, penalty fees mostly. * how do you manage for other teams' risk? (ie their api goes down, you can't satisfy your customers) Not sure I understand the question, if an API goes down and affects our services then that API needs to be monitored and handled by our on-call team. * any other tidbits I'm not in the on-call team but I stay available for specialized expertize if the 1st line can't solve an issue. I know how they work though so here's one example. It all depends on the clients SLA but let's say the client has 99% uptime, 24/7 on-call duties in their contract. In that case one person out of 5 rotates an on-call device (phone) every 5 weeks. In the strictest of SLA they're required to respond within 15 minutes and sometimes have a solution within 4 hours. This varies wildly from contract to contract. Of course an incident manager is available, redundantly, and is tasked with coordinating skills between departments to solve an issue within the designated SLA. Alerts come into the device and the tech can respond to alerts directly via the device to acknowledge, force re-check or simply disable. There is also a more featured web interface for the alerts to access via a browser. Alerts are sent with SMS through a self-hosted gateway. Directly attached to the monitoring server, not using any e-mail translation API. Alerts are logged, and in some cases a ticket is created for an alert. Preferably a manager should work out the on-call schedule, but techs often trade weeks and are more than capable of handling it themselves. They receive an added monthly compensation including overtime. So any work must be reported in a time reporting utility to eventually lead to payed overtime depending on the contract it pertained to and the time when it happened. |