|
well, in many places DevOps is implemented as "developers on PagerDuty". When I (the developer) have to be on-call for 7 day rotations, phone by bedside, paged at all hours, then I'm most definitely acting as operations - probably NOT what I signed up for. And, contrary to the stated intentions, I've directly observed developers making crappy, band-aid fixes to ongoing production problems in the interest of "making the pages stop". This is the mindset when you are on call be being paged at all hours. In theory, DevOps is supposed to put those that can best fix things closest to the problems, but in reality a slight separation from the firestorm of ops actually produces better, more thoughtful solutions in the long run. The best balance is to have a first tier Ops on-call, 2nd tier engineering on-call, and any alerting issues get attention within 24 hours, moving to the front of the work-queue. But, indiscriminately assigning everyone "pager-duty" rotations leads to lower quality solutions in the end. |
https://news.ycombinator.com/item?id=7575875
• It increases pager coverage, and reduces any one person's pager obligations. Simply having pager anticipation is a mental burden after a while.
• It creates a stronger incentive for response procedures: what are the expected obligations of response staff, what's considered sufficient effort, what's the escalation policy, who is expected to participate, what are consequences of failure to respond?
• Cross-training. Eng learns ops tasks, ops has a better opportunity for learning what eng is up to and deals with.
• It makes engineering more aware of the consequences of their actions: is insufficient defensive engineering causing outages (say, unlimited remote access to expensive operations), are alerts, notification mails, and/or monitoring/logging obscuring rather than revealing anomalous conditions? Are mechanisms for adjusting, repairing, updating, and/or restarting systems complex and/or failure prone themselves?
My experience at one site, where I was a recent staff member (and hence unfamiliar with policies, procedures, and capabilities), systems went down starting at 2am, I was unable to raise engineering or my manager, and the response the next staff meeting to my observation of this was pretty much "so what" did not endear me to the organization (I left it shortly afterward).
Note that what I'm calling for isn't for eng to be the sole group on pager duty, but for eng and ops to share that responsibility.