Hacker News new | ask | show | jobs
by foz 4615 days ago
At my company we've been using Graphite and StatsD for nearly two years now, we rely on it heavily for tracking performance and troubleshooting issues. We rely on Icinga, Pingdom, NewRelic and other tools to alert of us of problems.

Often, when things have gone really wrong (DoS, internal network issues, app errors, disk full) the affected machine(s) stop reporting to graphite (or under-report data). We get alerted by monitoring the services, not the stats.

Being alerted about low or unusual values might be helpful in some cases, but based on my experience, it would too noisy. Usually when something bad happens, we anyway investigate Graphite and analytics tools to understand the impact on traffic and KPIs.

I could see Rearview being useful for some cases, but not as a replacement for real monitoring and alerting tools.

3 comments

We use NewRelic and Pingdom as well. Where Rearview really shines is creating monitors like this: 1) control charts to alert when a process deviates from a range of 3 stdev above or below the mean based on historical data (e.g. purchases/logins are lower than expected, process failures are higher than expected, etc.), 2) deployment triggered monitors that automatically analyze data before and after a deploy for shifts in mean or increases in variance (e.g. do we see more login failures after this deploy, do we see more 4xx/5xx responses, did page load time increase, etc.), 3) response time monitors... while this seems straightforward enough, Rearview can not only tell you when a service or page response time has exceeded some statistical limit, it can also present you with more information regarding causes (e.g. this process is slow because of an issue with the database, redis, a dependent process/service, etc.), 4) it allows you to use SPAN as a means of monitoring load time or response time (SPAN is the 95th percentile - the 5th percentile and it give a much more accurate representation of what users experience than mean or median, 5) process efficiencies can be checked by making sure they complete on time and execute the expected number of commands (e.g. sent email, updated databases, etc.), and many more. Basically you are only limited by your imagination and coding skills. Of course the other benefit is in performing similar monitoring on business metrics and not just application performance (e.g. is funnel performing as expected/needed, are our customer tools being used on a regular basis, are our marketing campaigns paying off, etc.)
In my currently non-existent freetime, I'm a Graphite co-maintainer (check github). If you have any improvements or suggestions, please feel free to send us pull requests. The current pull requests are a bit of a mess, but I blame myself and will be getting around to merging a ton of them "real soon now TM".
Thank you for your work on Graphite. For all it's UI strangeness and quirks, it is a great solution that a lot of people love (myself included).

I'll peek at the pull requests and see if my company might be able to contribute some help.

Rearview compliments these services, and is not intended as a replacement for them. While there is overlap, the scope is different.

Pingdom will tell you that your engine just threw a rod. Rearview will tell you your rods are knocking before that happens.