| I've written a few guides on this. Some quick pointers: - You build it, you run it If your team wrote the code, your team ensures the code keeps running. - Continuously improve your on-call experience Your on-call staff shouldn't be on feature work during their shift. Their job is to improve the on-call experience while not responding to alerts. - Good processes make a good on-call experience In short, keep and maintain runbooks/standard operating procedures - Have a primary on-call, and a secondary on-call If your team is big enough, having a secondary on-call (essentially, someone responding to alerts only during business hours) can help train up newbies, and improve the on-call experience even faster. - Handover between your on-call engineers A regular mid-week meeting to pass the baton to the next team member ensures ongoing investigations continue, and that nothing falls between the cracks. - Pay your staff On-call is additional work, pay your staff for it (in some jurisdictions, you are legally required to). More: https://onlineornot.com/incident-management/on-call/improvin... |