|
I'm responding _before_ reading the comments, because this is something I feel like I have some fairly strong opinions on. I see two discrete issues here. I'll post one here, and one in another comment because I'm apparently long-winded today :). First, as a developer, you _will_ break things. There is absolutely no getting around that - it will happen. You can't choose not to break things. You _can_ choose to learn from it. I've broken production systems many times, and will break them in the future. I've been writing software for sixteen years now; I break things less often now than I did when I was more junior, but I would say "frequency" is the biggest difference that comes with experience. Instead, I tend not to break things in the same ways. When I review my work (and the work of teammates) I do so with the critical eye of someone who's seen how things are likely to break. I also look for ways that things are likely to go wrong and pre-plan - to the extent that's possible - how they can be fixed. Finally, I do everything I can to make it easy to troubleshoot and recover when things (invariably) go wrong. I have a very thorough and "intentional" plan for when things are on fire, and always pause before taking a remediation step. For example, I once deleted a major project on a scientific publishing platform. The company was very new, and while we had backups, the recovery plan was not exercised and significant work would have been lost if I'd used it. In this case, I was intending to update a single nested attribute on a MongoDB object, but used the wrong operator and replaced the entire document with the single attribute I wanted to update. As soon as I saw what happened, I took a deep breath, and made sure my next action was intentional. My first instinct was to call over the founder and tell them what happened. I estimated that it would take about 5-10 minutes to explain the issue, so I considered my other options. Restoring from backup would take too long, but mounting a backup on another DB was something I could trigger in a few minutes. I sat that aside for the moment as well, but remembered that I should initiate that if I thought I might need it later. Then I looked at what I had in my shell. I scrolled back and found that I had read the document in the Mongo shell a few minutes before performing the update... and the entire document was right there in the output! I took a picture of the screen with my phone. Yes, that's silly, but I wanted to make sure that I didn't misclick or something and completely lose the output. Then I took a screenshot on my MBP, and then I copied the output from the terminal and pasted it into a new text document in a new terminal. I saved that, then opened it in a third shell and verified that it was in fact there. Back to another shell, I built the command I would use to restore the object in Vim. It looked good. I connected to my local database, pasted the command, and executed. It worked! I verified that the data looked like it should, and it did. I then pasted the command into production, visually verified that it was correct, and executed. It worked! In the above, the Total time from "Oh crap!" to "Whew!" was less than five minutes. As I mentioned, it would have taken me at least that long to call over a more senior person and bring them up to speed. I stood by my decision not to bring others into the emergency immediately for that reason, but that doesn't mean that I was "hiding" anything. Once I saw that it had been fixed, I sent them a Slack message to the effect of "I just broke Project X by overwriting the document in MongoDB. I believe I have resolved it, but please verify it thoroughly ASAP. I'm writing a summary of what happened right now." I then wrote the summary and sent it to him. That summary ended up being the basis for a presentation that I gave at the following all-hands meeting. |
Let's talk about the cultural aspect first. You said:
I think you're on the right track here. You're recognizing that the people "above" you in the organization have different motivations, and allowing the possibility that you may just be wrong in how you view this because you are operating with a different perspective on the system as a whole.That said... "we need to prioritize features over stability" is a huge red flag to me. In my experience it generally means that the people in control of the company are measuring the wrong things; they're prioritizing increasing the effectiveness of your sales funnel by trying to satisfy everyone when they should be either focusing on ARR (annual recurring revenue) by making your existing customers happy or changing their customer acquisition strategy to get potential customers with the "right" problem into the funnel in the first place.
Basically, de-emphasizing stability in order to grow will quickly lead to problems with retention. It's a totally valid choice to do it for short periods, but if you go too far it's very hard to come back.
My boss also is correct when they say that these issues happen less than 1% of the time. I don't feel like I have buy-in to make any changes to this process. Alerts are ignored, and some people don't even have PagerDuty set up, their alerts go nowhere.
My gut says this is the heart of the issue here. I think you care deeply about the people who use your product, and that you feel like the people around you don't.If I were you, I would start preparing to leave the company if necessary. Once I was confident that I had a place to go, I'd stretch myself and try to influence the people around me to improve quality and change the focus to sustainable growth. It might work. If not, while it's possible they'd just fire me because of the pushback, it's much more likely that I'd just get to the point where I realized that this wasn't the place for me.