| Let's first say that I am the co-founder an CEO of Instana, but I am trying to give a generic answer so that I don't "attack" competitors. Most of the mentioned tools in this thread, including Datadog, SignalFX etc are using a simple agent to collect data - see Datadog agent on GitHub: https://github.com/DataDog/dd-agent or statsD (https://github.com/etsy/statsd) that is mostly recommended by SignalFX who have no own agent. Tools like Prometheus work similar. On the backend side you can see two approaches for data store technology: A time series based approach like DataDog or Prometheus and a Streaming based approach like SignalFX - stream are the superior approach in my point of view as they allow for realtime approaches and stream (window based) analytics. There is a third category which is similar to time series but more "log" centric like the ELK stack or tool like Splunk. On top of the data store these tools give you the ability to build your own dashboards (and provide standard dashboards for standard technology) and a alerting based on thresholds. They also allow to add you own metrics via API which can be used to add application specific data. They also give you a query API to query and combine the data in the store. So overall this is a Lambda architecture for monitoring data. I would say that SignalFX is the most sophisticated one but the framework to work on stream is much more complicated then DataDogs time series approach so people go the easier way. The problem with all of these tools is that they rely on the user to build dashboards, thresholds and in case of a problem do the correlation to find the root cause of the problem. To correlate you need to understand the dependencies of the system components. As an easy example if service A has a performance issues because it calls service B that has a CPU problem, you need to know that A calls B and correlate the latency of A with the latency and CPU of B to find the root cause. You can discover/model dependencies with tools like Zipkin (https://github.com/openzipkin) or Spring Cloud Sleuth (https://cloud.spring.io/spring-cloud-sleuth/) which are based on the Google Dapper paper. You could even add or log the Span ID to the metrics/logs so that you can correlate them automatically. Typically if you do so manually it is a disaster for change. All your correlations (and even dashboards) will not work if the topology of your services changes. Which is quite normal in the microservice world. Instana uses a stream based approach similar to SignalFX BUT we combine this with a graph database that holds the dependencies of all physical and application dependencies. Our agent automatically discovers all the components and dependencies and adds them to the graph in realtime - including containers etc. We then use the Google Four golden signals + Capacity (that was added by Netflix as the fifth one) to analyze the KPIs of the services and apply machine learning on it. That way we don't need manual thresholds which are also hard to maintain when things change a lot. If we see e.g. slow response times or sudden drops in requests or high error rates, then we analyze the dependency tree of that service to find the issues that are related to the problem and generate an incident for that - as we also discover changes, we add them to the incident as most often a change is the reason for a problem. I've written a blog entry on the Dynamic Graph: https://www.instana.com/blog/monitoring-microservice-applica... Hope this answers you question. Mirko |
I'd see them as slightly different approaches to providing fundamentally the same solution. One builds up time series and then operates on them, the other operates on the time series as they come in.
Taking Prometheus as an example we're a time series database, and you can do both realtime and window-based analysis. In fact that's how it is usually used.
> I would say that SignalFX is the most sophisticated
Do you have an example of something that you can do with your streaming approach that's not possible with other tools?
It's hard to get a proper understanding of the myriad of monitoring systems out there, so I'm always looking for insights.
> Our agent automatically discovers all the components and dependencies and adds them to the graph in realtime.
That sounds interesting, how do you do that for network dependencies? Do you have something like Zipkin?