| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by de107549 3620 days ago

Let's first say that I am the co-founder an CEO of Instana, but I am trying to give a generic answer so that I don't "attack" competitors.

Most of the mentioned tools in this thread, including Datadog, SignalFX etc are using a simple agent to collect data - see Datadog agent on GitHub: https://github.com/DataDog/dd-agent or statsD (https://github.com/etsy/statsd) that is mostly recommended by SignalFX who have no own agent. Tools like Prometheus work similar.

On the backend side you can see two approaches for data store technology: A time series based approach like DataDog or Prometheus and a Streaming based approach like SignalFX - stream are the superior approach in my point of view as they allow for realtime approaches and stream (window based) analytics. There is a third category which is similar to time series but more "log" centric like the ELK stack or tool like Splunk.

On top of the data store these tools give you the ability to build your own dashboards (and provide standard dashboards for standard technology) and a alerting based on thresholds. They also allow to add you own metrics via API which can be used to add application specific data. They also give you a query API to query and combine the data in the store. So overall this is a Lambda architecture for monitoring data.

I would say that SignalFX is the most sophisticated one but the framework to work on stream is much more complicated then DataDogs time series approach so people go the easier way.

The problem with all of these tools is that they rely on the user to build dashboards, thresholds and in case of a problem do the correlation to find the root cause of the problem.

To correlate you need to understand the dependencies of the system components. As an easy example if service A has a performance issues because it calls service B that has a CPU problem, you need to know that A calls B and correlate the latency of A with the latency and CPU of B to find the root cause. You can discover/model dependencies with tools like Zipkin (https://github.com/openzipkin) or Spring Cloud Sleuth (https://cloud.spring.io/spring-cloud-sleuth/) which are based on the Google Dapper paper. You could even add or log the Span ID to the metrics/logs so that you can correlate them automatically.

Typically if you do so manually it is a disaster for change. All your correlations (and even dashboards) will not work if the topology of your services changes. Which is quite normal in the microservice world.

Instana uses a stream based approach similar to SignalFX BUT we combine this with a graph database that holds the dependencies of all physical and application dependencies. Our agent automatically discovers all the components and dependencies and adds them to the graph in realtime - including containers etc.

We then use the Google Four golden signals + Capacity (that was added by Netflix as the fifth one) to analyze the KPIs of the services and apply machine learning on it. That way we don't need manual thresholds which are also hard to maintain when things change a lot. If we see e.g. slow response times or sudden drops in requests or high error rates, then we analyze the dependency tree of that service to find the issues that are related to the problem and generate an incident for that - as we also discover changes, we add them to the incident as most often a change is the reason for a problem. I've written a blog entry on the Dynamic Graph: https://www.instana.com/blog/monitoring-microservice-applica...

Hope this answers you question.

Mirko

1 comments

bbrazil 3620 days ago

> stream are the superior approach in my point of view as they allow for realtime approaches and stream (window based) analytics.

I'd see them as slightly different approaches to providing fundamentally the same solution. One builds up time series and then operates on them, the other operates on the time series as they come in.

Taking Prometheus as an example we're a time series database, and you can do both realtime and window-based analysis. In fact that's how it is usually used.

> I would say that SignalFX is the most sophisticated

Do you have an example of something that you can do with your streaming approach that's not possible with other tools?

It's hard to get a proper understanding of the myriad of monitoring systems out there, so I'm always looking for insights.

> Our agent automatically discovers all the components and dependencies and adds them to the graph in realtime.

That sounds interesting, how do you do that for network dependencies? Do you have something like Zipkin?

link

de107549 3620 days ago

I agree that streaming and timeseries queries/scans are two different approaches which can solve the problem in the same way. With instant vectors of Prometheus queries you can operate very similar to windows and if you do the right queries and take care that it works in-memory you also should get similar performance and throughput.

My point was more about the framework you get and how easy it is to apply analytics to streams/queries. SignalFx seems to have a nice workbench for this with direct visual feedback in the UI, so that you can work on existing data to get the right result.

As said we at Instana think that most people will not be able to build a sophisticated monitoring solution with these types of frameworks as they don't have the time to do it and maybe even not the analytical domain knowledge. You can see that SignalFx is adding specific knowledge for some technologies. I give you two simple examples to show that it is not easy:

- How would you predict if a file system is running out of disk space?

- How would you predict if you should add a node to a Cassandra cluster because it is running out of capacity (and it can take some serious time to add a node, so you should know in advance)?

Already the disk space problem is hard to solve - linear regression and basic algorithms will not work.

Now think of hundreds (or thousands) of services running on a dynamic container platform and new services released on a daily or even minute basis - with lots of different technologies involved...

No question that you can build a good monitoring solution with Prometheus, SignalFX, DataDog etc - but it will take a serious amount of time, consulting and dev teams involved adding the right instrumentation, metrics etc. And you need a lot of analytical knowledge. I can even imagine that there are situation were tools like Prometheus are a better choice - especially if you have a very strict set of technologies and communication framework and really good people to do a very specific set of "rules" for this environment.

We've added a domain model to our product (all the mentioned product have a generic metric model, but no semantics that describe servers, containers, processes, services and their communication which is the domain of system and application monitoring): Our Dynamic Graph.

And yes, we are using something very similar to Zipkin to get the dependencies between services. Here a are two blog entries describing the approach:

- About distributed tracing: https://www.instana.com/blog/evolution-tracing-application-p...

- How we safely instrument code: https://www.instana.com/blog/how-instana-safely-instruments-...

Mirko

link

otterley 3619 days ago

> SignalFx seems to have a nice workbench for this with direct visual feedback in the UI, so that you can work on existing data to get the right result.

Wavefront does as well; I'd recommend you compare it for competitive analysis.

So would you say your product is in direct competition with these offerings, or do you see it more as a complement to them?

link

de107549 3618 days ago

Yes, I didn't compare to Wavefront as I have only basic insights and therefore cannot make a valid statement.

Competition depends on the uses case - if you are using a tool like SignalXF for custom metric analytics, then we are no competition as our focus is monitoring of applications and its underlying infrastructure.

We are an Application Performance Management (APM) solution and therefore compete more with tools like New Relic oder AppDynamics. Theses tools are sadly only used for troubleshooting in 90% of the cases and not for management or monitoring. They also do not work in highly dynamic and scaled environments as there "model" is too static. (which they try to fix with their analytics offerings)

This is what we want to change and were we add the whole stack to the game to analyze all the dependencies and help finding root causes quickly and monitor and predict the KPIs of your applications, services, clusters and components.

We integrate with solutions like SignalFX if needed but I have really good experience to do "dashboarding" with more business related tools like Tableau or QlikView - this also offers application owners an easier way to aggregate the monitoring data and metrics on a higher (business) level, where tools like Instana offer the instrumentation data as an input.

link