| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by crescentfresh 4205 days ago
	What are the sources for data for Atlas vs for Suro at Netflix? (Suro: http://techblog.netflix.com/2013/12/announcing-suro-backbone...). Suro was/is used to collect "more than 1.5 million events per second during peak hours, or around 80 billion events per day" from ec2 instances.

2 comments

copperlight 4205 days ago

There are several different data sources for Atlas:

* There is a poller cluster that gathers SNMP and HTTP healthcheck metrics and forwards them to the Atlas backend.

* There are on-instance log parsers written in Perl and Python that count events in Apache HTTPd and Tomcat logs and send data to the Atlas backend.

* The Servo library [0] is used to instrument Java code with counters, timers and gauges. There is a separate client implementation that handles forwarding metrics to the Atlas backend. The client also polls and reports JMX metrics from the JVM that it runs inside. Spectator [1] is a new library that provides cleaner abstractions of Servo concepts.

* The Prana sidecar [2] was extended to provide REST endpoints for Servo and the client, so that metrics can be delivered from non-Java code.

[0] https://github.com/Netflix/servo

[1] https://github.com/Netflix/spectator

[2] https://github.com/Netflix/prana

link

lifeisstillgood 4205 days ago

What kind of ratio of metadata traffic (telemetry) to total traffic did you see? How does this divide between "system level" and "application level"?

My client is lookin at these telemetry problems now, is there possibility of commercial high-level consultancy coming out of Netflix / colleagues ? Ping me on details in my profile if you can help?

link

copperlight 4205 days ago

Telemetry traffic is a small fraction of the the total traffic running through a region, partially due to the use of the Smile data format (binary JSON) for delivering metrics from the client to the Atlas backend.

When you give developers tools for creating and aggregating highly dimensional metrics, they tend to create lots of metrics so that they can answer interesting business questions about the use of their applications. We have some developers who have written code that produces up to 150,000 metrics per instance and the vast majority of these metrics are application-level. We typically see 3-5% of the metrics delivered from an instance are system-level performance metrics.

link

CrankyFool 4205 days ago

And this, of course, doesn't account for cases where a minor developer error results in code that, say, creates a new metric for every source IP address from which we see a request. Dynamic metric names FTW.

link

lifeisstillgood 4205 days ago

Thank you both for your insights

link

jedberg 4205 days ago

Suro collects arbitrary data, and Atlas is for numbers (time series numbers to be more specific). So metrics generally go through Atlas and logs go through Suro.

link