|
So I think the mathematics to make this work is not the problem. How do you engineer this though? If I were to try and build a platform that could do this in real-time for, lets say, a million metrics per minute, can you engineer something that would scale horizontally to do this? Can it be done by cobbling together various open-source tools/libraries currently out there? Then how would you present the results in a way that someone that's not necessarily "mathematically inclined", say for example, your typical operational support person, that they could meaningfully interpret whatever your system is spitting out? That's for me the hard part, is to get those two components working well. Make it scale, make it idiot friendly. If you can't get those parts right, it doesn't matter what you're trying to do. I say this because I've spent the last 6 years in the application performance management space and "the best" way to handle alarms at the moment is to put down a team, literally a team, of people and have them hand-tune thresholds by looking at a combination of history, incidents/outages and root cause outcomes, domain specialist inputs (like DBAs or application server specialists). You send out a false or noisy alarm to an ops guy too many times and they become desensitized. You don't put enough context in your alarm messages, they won't use it (logging into a tool is asking too much, the email must contain everything they need or they complain). Any form of dynamic baselining is just too noisy. The simplest example is trying to "baseline" CPU usage. CPU usage without something trivial like comparing to run-queue is stupid. It's actually even more stupid because you should be looking at things top-down, i.e. so what if the CPU is 100% and the run queue is 100, are any user facing transactions slowing down? i.e. is there customer impact. It could be some batch job kicking off. So in short, anything that looks at a metric in isolation is stupid, dynamic baselines with time of day, day of month, etc. it's all rubbish shit, you're wasting your time with this approach. This is the sad state that current "cutting edge" third generation APM tools offer though. |