This reminds me of Steve Yegge's google plaforms rant, which I'm sure I've posted and reposted before:
monitoring and QA are the same thing. You'd never think so until you try doing a big SOA. But when your service says "oh yes, I'm fine", it may well be the case that the only thing still functioning in the server is the little component that knows how to say "I'm fine, roger roger, over and out" in a cheery droid voice. In order to tell whether the service is actually responding, you have to make individual calls. The problem continues recursively until your monitoring is doing comprehensive semantics checking of your entire range of services and data, at which point it's indistinguishable from automated QA. So they're a continuum.
In smaller projects I've worked on, I'm not willing to expend the level of effort required for truly comprehensive monitoring, but I try to do some kind of end-to-end test of the whole system as part of the monitoring -- something that requires all components to be alive and working -- so that I'm aware of issues at least as fast as the people using the system.
Some shops call this "synthetic monitoring" where the tests run are a subset of a full integration test, that can be run at regular, but timely, intervals of say every 10 or 15 minutes. "The happy path" for some typical use cases.
This monitoring shouldn't brush aside other system-level monitoring - which can alert on abnormal memory, disk space, error rates, etc.
There's also an analogous split to unit testing vs integration/system testing. You want system-wide monitoring to give the strongest guarantee that a service is actually up and available to customers, but you also want monitoring on each component so you can pinpoint the source of failures more quickly.
Author here. It's been awhile since I wrote this and we've learned a ton more about what good API monitoring looks like. So if you have any questions, let me know!
monitoring and QA are the same thing. You'd never think so until you try doing a big SOA. But when your service says "oh yes, I'm fine", it may well be the case that the only thing still functioning in the server is the little component that knows how to say "I'm fine, roger roger, over and out" in a cheery droid voice. In order to tell whether the service is actually responding, you have to make individual calls. The problem continues recursively until your monitoring is doing comprehensive semantics checking of your entire range of services and data, at which point it's indistinguishable from automated QA. So they're a continuum.
https://gist.github.com/chitchcock/1281611
In smaller projects I've worked on, I'm not willing to expend the level of effort required for truly comprehensive monitoring, but I try to do some kind of end-to-end test of the whole system as part of the monitoring -- something that requires all components to be alive and working -- so that I'm aware of issues at least as fast as the people using the system.