|
|
|
|
|
by viraptor
1561 days ago
|
|
It's not just a hard problem, it's impossible to go from metrics to automatic announcements. Give me a detected scenario and I'll give you an explanation for seeing errors which are unrelated to the service health as seen by customers. From basic "our tests have bugs", to "reflected DDoS on internal systems causes rate limit on test endpoint", to "deprecated instance type removal caused a spike in failed creations", to "bgp issues cause test endpoints failures, but not customer visible ones", to... You can't go from a metric to diagnosis for a customer - there's just no 1:1 mapping possible, with errors going both ways. AWS sucks with their status delays, but it's better than seeing their internal monitoring. |
|
I remember one night a long time ago while working at Google, I was trying a new approach for loading some data involving an external system. To get the performance I needed, I was going to be making a lot of requests to that service, and I was a little concerned about overloading it. I ran my job and opened up that service's monitoring console, poked around, and determined that I was in fact overloading it. I sent them an email and they added some additional replicas for me.
Now I live in the world of the cloud where I pay $100 per X requests to some service, and they won't even tell me if the errors it's returning are their fault or my fault? Not acceptable.