| I have a lot of APIs for different apps that I've built for customers. And most of the time it is when the customer calls me I get to know that the API is down! Maybe after some data changes, or some random code changes. I'm now having nightmares about API services going down. Do you continuously monitor your APIs and get notified when something goes down? (It's not just like monitoring a website using Pingdom, but the actual data responses, like for example check if a particular JSON field exists in the response) |
1. Heartbeat, which checks the service is responding. Used for HTTP monitoring services to make sure the route is up mainly.
2. Heartbeat which checks the service is up and validates all database connections are up and I can get data from the database (usually a trivial query on a small table). Used as the primary detection for internal failures.
3. For any 3rd party services I depend on I setup a heartbeat endpoint for them that will check the service is up (but not necessarily giving me good data). Usually I group them, sometimes I group and separate them under like /heartbeat/services, /heartbeat/service1, /heartbeat/service2. Sometimes you can validate the service is returning good data but not all the time is it easy to do that, so I do what I can.
4. I setup a 3rd party service to monitor the heartbeats and the return code to validate they are up and properly returning what I expect, notify me if not. I don't have to do sophisticated response processing at the 3rd party service because I can just use http return codes 99% of the time. The detailed response checking is done at the heartbeat level, then a response code generated. And of course, any failure to respond shows too.
This is still not perfect, but it has proven to make sure we know before anyone else when something fails. I still have one product that we haven't converted to this process right now but we are migrating to a new version that has these checks so it will help me sleep better the faster that happens.
One key thing is don't make the check interval too crazy, the general http is used a lot for the load balancers, but the others are spread out a lot more to reduce creating artificial load. When we build an independent service (microservice etc) I make sure they have these same checks, although it might not be http based. But since they have the same basic methodology a service watcher can remove any instance from the registry if a check fails after some configured number of failures & retries etc.
*edit a few words