Hacker News new | ask | show | jobs
by brendangregg 4205 days ago
I work at Netflix and use Atlas every day. It's our go-to performance monitoring tool, and has solved countless performance and reliability issues. It's exciting to have it open source!

I summarized it in a talk recently at Surge 2014, where I showed its role for a performance investigation, and how it is central to everything:

http://youtu.be/H-E0MQTID0g?t=22m http://www.slideshare.net/brendangregg/netflix-from-clouds-t...

There's more features to Atlas I didn't mention; check out the Overview on github linked in the blog post.

4 comments

What are the sources for data for Atlas vs for Suro at Netflix?

(Suro: http://techblog.netflix.com/2013/12/announcing-suro-backbone...). Suro was/is used to collect "more than 1.5 million events per second during peak hours, or around 80 billion events per day" from ec2 instances.

There are several different data sources for Atlas:

* There is a poller cluster that gathers SNMP and HTTP healthcheck metrics and forwards them to the Atlas backend.

* There are on-instance log parsers written in Perl and Python that count events in Apache HTTPd and Tomcat logs and send data to the Atlas backend.

* The Servo library [0] is used to instrument Java code with counters, timers and gauges. There is a separate client implementation that handles forwarding metrics to the Atlas backend. The client also polls and reports JMX metrics from the JVM that it runs inside. Spectator [1] is a new library that provides cleaner abstractions of Servo concepts.

* The Prana sidecar [2] was extended to provide REST endpoints for Servo and the client, so that metrics can be delivered from non-Java code.

[0] https://github.com/Netflix/servo

[1] https://github.com/Netflix/spectator

[2] https://github.com/Netflix/prana

What kind of ratio of metadata traffic (telemetry) to total traffic did you see? How does this divide between "system level" and "application level"?

My client is lookin at these telemetry problems now, is there possibility of commercial high-level consultancy coming out of Netflix / colleagues ? Ping me on details in my profile if you can help?

Telemetry traffic is a small fraction of the the total traffic running through a region, partially due to the use of the Smile data format (binary JSON) for delivering metrics from the client to the Atlas backend.

When you give developers tools for creating and aggregating highly dimensional metrics, they tend to create lots of metrics so that they can answer interesting business questions about the use of their applications. We have some developers who have written code that produces up to 150,000 metrics per instance and the vast majority of these metrics are application-level. We typically see 3-5% of the metrics delivered from an instance are system-level performance metrics.

And this, of course, doesn't account for cases where a minor developer error results in code that, say, creates a new metric for every source IP address from which we see a request. Dynamic metric names FTW.
Thank you both for your insights
Suro collects arbitrary data, and Atlas is for numbers (time series numbers to be more specific). So metrics generally go through Atlas and logs go through Suro.
I really enjoyed your overall Linux Performance Tool talk: http://youtu.be/SN7Z0eCn0VY

Thanks for sharing your expertise with performance monitoring. Releasing Atlas is adding tons of value to what I've already seen.

Brendangregg, do you know if anyone tried to use this platform to predict failures or DDOS attacks ahead of time? Or it is not feasible with this API? Thanks.
We've used it to predict failures as well as a data source to predict scale up and scale down events. It hasn't been used for DDos prediction, but I see no reason why it couldn't.
Not Brendan, but ... we do a bunch of outlier and anomaly detection using Atlas to notice slow degradation in cluster performance based on outlier nodes and auto-execute them.
OT, but I'm using your app on a stock Nexus 7 (at 4.4.4), and I find a couple of controls very difficult to trigger, while others respond just fine at the first touch.

The "back up 30 seconds" button can be very fussy and difficult to trigger. Sometimes it takes many touches before it will trigger.

Also, when in a series the end titles of an episode display, it can be quite difficult to get a touch to register so that the titles play out and the app does not skip forward to the next episode before they do. (And even then, another count-down starts that will auto-launch the next episode. I wish (and expressed to your call-in support) that you would add a user account setting to disable this.)

Sorry for the OT, but I would think you'd want the app more responsive in this regard. As I said, the other controls/buttons do not exhibit this difficulty; they are immediately responsive.

I'll take the downvotes if that's necessary.

Netflix had months-long problems with the audio on some of their programs, including programs that others here mentioned watching.

It was only after I made some comments regarding these audio problems, here and on Reddit, that they were fixed.

(I'd reported them via Netflix's problem reporting mechanisms, months earlier when I discovered them, to no effect.)