Hacker News new | ask | show | jobs
by davismwfl 2258 days ago
Yes, I for every API I build I setup a few specific heartbeat style endpoints. I have done this since I was a consultant and even now I do it within the company I am at now.

1. Heartbeat, which checks the service is responding. Used for HTTP monitoring services to make sure the route is up mainly.

2. Heartbeat which checks the service is up and validates all database connections are up and I can get data from the database (usually a trivial query on a small table). Used as the primary detection for internal failures.

3. For any 3rd party services I depend on I setup a heartbeat endpoint for them that will check the service is up (but not necessarily giving me good data). Usually I group them, sometimes I group and separate them under like /heartbeat/services, /heartbeat/service1, /heartbeat/service2. Sometimes you can validate the service is returning good data but not all the time is it easy to do that, so I do what I can.

4. I setup a 3rd party service to monitor the heartbeats and the return code to validate they are up and properly returning what I expect, notify me if not. I don't have to do sophisticated response processing at the 3rd party service because I can just use http return codes 99% of the time. The detailed response checking is done at the heartbeat level, then a response code generated. And of course, any failure to respond shows too.

This is still not perfect, but it has proven to make sure we know before anyone else when something fails. I still have one product that we haven't converted to this process right now but we are migrating to a new version that has these checks so it will help me sleep better the faster that happens.

One key thing is don't make the check interval too crazy, the general http is used a lot for the load balancers, but the others are spread out a lot more to reduce creating artificial load. When we build an independent service (microservice etc) I make sure they have these same checks, although it might not be http based. But since they have the same basic methodology a service watcher can remove any instance from the registry if a check fails after some configured number of failures & retries etc.

*edit a few words

1 comments

Sounds good. Do you use any platform to setup this monitor?
I have used lots of different ones over the years, right now I am using one I don't want to mention simply because we won't be staying on it ourselves (and I won't recommend something I won't use).

That said, there are lots of services that do it well, the key of course is the service itself has to be reputable and solid, not knocking anyones homespun version, but your monitoring is only as reliable as their service. This is why we are going to move ourselves again.

For one I have used in the past, "uptime" worked well for a couple of my clients, was reliable and stable.