There are several different data sources for Atlas:
* There is a poller cluster that gathers SNMP and HTTP healthcheck metrics and forwards them to the Atlas backend.
* There are on-instance log parsers written in Perl and Python that count events in Apache HTTPd and Tomcat logs and send data to the Atlas backend.
* The Servo library [0] is used to instrument Java code with counters, timers and gauges. There is a separate client implementation that handles forwarding metrics to the Atlas backend. The client also polls and reports JMX metrics from the JVM that it runs inside. Spectator [1] is a new library that provides cleaner abstractions of Servo concepts.
* The Prana sidecar [2] was extended to provide REST endpoints for Servo and the client, so that metrics can be delivered from non-Java code.
What kind of ratio of metadata traffic (telemetry) to total traffic did you see? How does this divide between "system level" and "application level"?
My client is lookin at these telemetry problems now, is there possibility of commercial high-level consultancy coming out of Netflix / colleagues ? Ping me on details in my profile if you can help?
Telemetry traffic is a small fraction of the the total traffic running through a region, partially due to the use of the Smile data format (binary JSON) for delivering metrics from the client to the Atlas backend.
When you give developers tools for creating and aggregating highly dimensional metrics, they tend to create lots of metrics so that they can answer interesting business questions about the use of their applications. We have some developers who have written code that produces up to 150,000 metrics per instance and the vast majority of these metrics are application-level. We typically see 3-5% of the metrics delivered from an instance are system-level performance metrics.
And this, of course, doesn't account for cases where a minor developer error results in code that, say, creates a new metric for every source IP address from which we see a request. Dynamic metric names FTW.
Suro collects arbitrary data, and Atlas is for numbers (time series numbers to be more specific). So metrics generally go through Atlas and logs go through Suro.
* There is a poller cluster that gathers SNMP and HTTP healthcheck metrics and forwards them to the Atlas backend.
* There are on-instance log parsers written in Perl and Python that count events in Apache HTTPd and Tomcat logs and send data to the Atlas backend.
* The Servo library [0] is used to instrument Java code with counters, timers and gauges. There is a separate client implementation that handles forwarding metrics to the Atlas backend. The client also polls and reports JMX metrics from the JVM that it runs inside. Spectator [1] is a new library that provides cleaner abstractions of Servo concepts.
* The Prana sidecar [2] was extended to provide REST endpoints for Servo and the client, so that metrics can be delivered from non-Java code.
[0] https://github.com/Netflix/servo
[1] https://github.com/Netflix/spectator
[2] https://github.com/Netflix/prana