| I will give my personal experience as an experienced engineer in FAANG - while trying to be very pragmatic from resources pov. Do this in the given order:
1. Add monitoring wherever you can. Not just the exceptions, but latencies, handled success/fail statuses, user flows from clicking on x to saving data on y etc. Create meaningful dashboards & alerts on them.
2. Add release canary process. If you have sufficient traffic, do an A/B test roll out. Have load balancer send 10% of traffic to new binary, 10% to old binary (and make it user sticky - so same user doesn't get bounced between different binaries) . You'd need to be mindful of caching (so version static assets etc).
3. Add tests to critical workflows. This part is usually the hardest to do and pay off is usually the longest but it's critical. However, only do this if your user flows are more or less settled and changes are limited. Also - before doing 3 - it'd be useful to identify the kind of bugs you see - is it client side bugs or server side bugs? Client side bugs are a pain in the ass overall as UX changes can render tests useless, but try to test critical components using a realistic user flow. Also - as each bug shows up, definitely add a regression test for it (which would have caught some of your bugs that you said "repeat"). Remember - you need to adopt "swiss cheese model"[A] for catching bugs. Each layer will catch something with (1) being a catch all, and the more layers you have the better - but only over time. The guilt goes away once you start catching bugs before client notices them, or as soon as the bug happens. And as you do (3) and (2) your confidence will improve. Bugs are fact of human life - no human is perfect, and software is complex. The more complex it is from user interaction pov, the more likelyhood of a bug. Don't sweat, keep doing, keep improving over mid-term and long term, and they'll happen less. [A] https://en.wikipedia.org/wiki/Swiss_cheese_model |