|
In my experience the best way to monitor is to have passive monitoring for all dependencies (error rate, latency, response time, throughput) across all touch points, and then to have active monitoring (health checks, acks/nacks) for all the things which are performing the passive monitoring, which are usually your services or applications. After that, you usually want to set some sort of anomaly watermarks, either manually based off a baseline or use one of the many anomaly detection solutions available. I've found many issues with providers this way, often before they even knew. It's also helped inform decisions to migrate to alternative providers or services when we are able to measure what the improvements would actually be, rather than relying on hand waving and marketing materials. This is all pretty easy stuff, but of requires discipline and the resources to invest in instrumenting everything. You need some level of buy in from leadership and it's all the more difficult if you have a toilsome ops or oncall rotation. If you are large enough and can afford it, I recommend empowering at least one reliable engineer to be tasked to solve the problem across the stack. The real problems are when you're operating a service you don't really own (i.e. a vendor) and there are issues related to how it interacts with something else. The only real solution, aside from getting the thing fixed or abandoning it, is to shim or proxy the dependencies such that you can instrument it as a black box. For example, if your vendor gives you a .jar that you configure to use S3, run a local proxy for S3 as a side car and collect stats there. This is a contrived example, but the concept should be clear. Often you can't even do this, as vendors hardcode stuff like AWS, and forget it if you're using something managed like Databricks or Snowflake. |