|
|
|
|
|
by cookiecaper
3807 days ago
|
|
Everyone takes a turn being on-call in one week cycles. We have an on-call day and an on-call night. Any infrastructure alarms go to the on-call person and he/she is expected to address the issue, which can include escalating it to the person who knows the most about it if the severity warrants that, or deciding it's not important and silencing for some time period. On-call fixes until the crisis is mitigated. If something can be addressed the next day during normal hours, it is. On-call tickets are the most important. In our company, "on-call" just means you get the infrastructure alarms for that week and time period, and gotta deal with them when you get them. There is no alteration in pay for on-call weeks. If the current on-call can't respond within 10 minutes (or forgets to ack in PagerDuty), the next person on the escalation schedule will be notified and expected to deal with the issue. Other teams' risk is mitigated because if it's breaking production, we just call them and make them fix it. The whole company is "always on-call" in some sense, because if your thing is breaking production, you're going to get a call and get asked to fix the problem. We feel like the tooling on this, even with PagerDuty whose whole job is managing on-calls, is subpar and are constantly talking about creating an in-house replacement. |
|