|
|
|
How do you folks monitor external dependencies at scale
|
|
42 points
by abhishekash
1642 days ago
|
|
Most software companies would have external dependencies like cloud platforms, APMs , Functional/Business integrations. How do you folks monitor these external dependencies errors , latency etc ? RSS feeds etc are there but there are days were AWS is slow to update status page, latency in CDN are not that obvious in the middle of night when you get paged. Makes me wonder what other folks are doing about this ? |
|
I've found many issues with providers this way, often before they even knew. It's also helped inform decisions to migrate to alternative providers or services when we are able to measure what the improvements would actually be, rather than relying on hand waving and marketing materials.
This is all pretty easy stuff, but of requires discipline and the resources to invest in instrumenting everything. You need some level of buy in from leadership and it's all the more difficult if you have a toilsome ops or oncall rotation. If you are large enough and can afford it, I recommend empowering at least one reliable engineer to be tasked to solve the problem across the stack.
The real problems are when you're operating a service you don't really own (i.e. a vendor) and there are issues related to how it interacts with something else. The only real solution, aside from getting the thing fixed or abandoning it, is to shim or proxy the dependencies such that you can instrument it as a black box. For example, if your vendor gives you a .jar that you configure to use S3, run a local proxy for S3 as a side car and collect stats there. This is a contrived example, but the concept should be clear. Often you can't even do this, as vendors hardcode stuff like AWS, and forget it if you're using something managed like Databricks or Snowflake.