So ideally you have some kind of monitoring that reports/shows how many services are alive (and where they live in a cluster), how many errors they generate etc. Then based on some thresholds you can take them out of circulation and let them cool down. If certain kinds of errors occurs, or at a certain frequency, the system can notify a site reliability engineer (or equivalent) to check it out. Then they can decide if it should be permanently removed and to log an internal support ticket and so forth for the developers or product teams.
Production issues are a part of life. You need to have some visibility on issues and their severity. Every company and tech stack is different, also depending on their SLA's and uptime promises.
Ads not rendering in an app might be less severe than a pump failure at a fuel station, so they have different kinds of monitoring and and reaction times to faults. Obviously things like hospitals, banks, airlines/aircraft manufacturers have way different requirements and infrastructure from say a system that manages all school libraries for a state/province.
There are too many products and approaches to mention here if you were looking for a list of those. I have one or two favorite approaches and a handful of tools for this kind of stuff, half of which is homemade, so not something you can google. But you can google it and see a few different approaches. "microservices monitoring java" or "microservices monitoring best practice" or something along those lines will get you on a path. Try to find 5 different approaches and reflect what each one is missing or how they may help you, and then ponder what would you like to see from a reporting system with hundreds/thousands of services.
And then obviously the the best lessons will come from production itself.
Only if you accept them. The alternative is to do very few, rigorously tested releases per year. This way you don't have production issues. That's how industries like banking make sure bank transfers and card payments work and people's money is not randomly lost... It's a shame many other industries just accept their product failing for users as something normal/inevitable.
I can't say my experience echoes your comment. I'm a former employer of a financial services (billing) company built around a mainframe code base started in the 70s. We probably qualify for the sort of business you had in mind with your comment.
We did four releases a year, across the entire organization (so mainframe and more modern platforms), on Saturday nights/early Sunday mornings. There was plenty of testing but there was still plenty of errors only found on the day of, and rushed to fix in the wee hours or daylight hours of Sunday morning.
The only thing that seemed to correlate with release quality was the overall risk of the release, i.e. the complexity and number of new features written during that quarter.
> We did four releases a year, across the entire organization (so mainframe and more modern platforms), on Saturday nights/early Sunday mornings. There was plenty of testing but there was still plenty of errors only found on the day of, and rushed to fix in the wee hours or daylight hours of Sunday morning.
This way, you had bugs in prod for less than a day once every quarter, as opposed to having buggy prod all the time, as is common in organizations doing Continuous Deployment.
Of course. Even the Space Shuttles blew up, twice. I'm guessing even pace makers and software in nuclear power plants have bugs. The point is, these things are exceedingly rare or have very limited scope (occur only in most obscure corner cases and also do limited damage), while in web companies which adopted Continuous Deployment, serious bugs are just common and I think seen as part of life.
Work in healthcare where we have heavily tested, quarterly releases. Well, we had a release today and some stuff was pretty horribly broken, despite being so heavily tested. We didn't adequately load test one piece of the new release under production-like conditions. Oops. Thankfully the fix was simple and a hotfix only took a couple of hours in total. Yet another lesson learned.
That's pretty bad, but nonetheless you detected and fixed it very quickly. Compare that to lingering bugs in Twitter iOS client (it's just broken on iPhone 5s, I guess they simply don't test on that device anymore), or happy random bugs in Windows 10 that appear after they CD an update on their users.
Then you get the worst of both worlds. You are in an industry where few very well tested releases are needed to meet SLA and customer expectations, but you have enough of the company looking at entirely different industries and wanting to follow their pipeline instead.
Sometimes, though thankfully less frequently (and for a less-disastrous definition of "serious") than I used to.
Luckily, a good CI/CD pipeline makes reversions just as easy as deployments. So even when you have errors, it's easier to correct than if you suddenly discovered "our deployment bash script / ansible playbook isn't as reversible as we thought it was"
Rarely. All features are gated by feature flags with the capability to dial up the feature gradually and dial down the launch instantly. I can monitor if the feature launch is going as expected by monitoring errors and metrics in the logs.
So ideally you have some kind of monitoring that reports/shows how many services are alive (and where they live in a cluster), how many errors they generate etc. Then based on some thresholds you can take them out of circulation and let them cool down. If certain kinds of errors occurs, or at a certain frequency, the system can notify a site reliability engineer (or equivalent) to check it out. Then they can decide if it should be permanently removed and to log an internal support ticket and so forth for the developers or product teams.
Production issues are a part of life. You need to have some visibility on issues and their severity. Every company and tech stack is different, also depending on their SLA's and uptime promises.
Ads not rendering in an app might be less severe than a pump failure at a fuel station, so they have different kinds of monitoring and and reaction times to faults. Obviously things like hospitals, banks, airlines/aircraft manufacturers have way different requirements and infrastructure from say a system that manages all school libraries for a state/province.
There are too many products and approaches to mention here if you were looking for a list of those. I have one or two favorite approaches and a handful of tools for this kind of stuff, half of which is homemade, so not something you can google. But you can google it and see a few different approaches. "microservices monitoring java" or "microservices monitoring best practice" or something along those lines will get you on a path. Try to find 5 different approaches and reflect what each one is missing or how they may help you, and then ponder what would you like to see from a reporting system with hundreds/thousands of services.
And then obviously the the best lessons will come from production itself.
Good luck!