Saving 13M Computational Minutes per Day with Flame Graphs | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	Saving 13M Computational Minutes per Day with Flame Graphs (techblog.netflix.com)
	88 points by mspier 3723 days ago

7 comments

azinman2 3723 days ago

What I think is interesting about this is that they weren't able to easily measure or find using existing tools these hotspots -- they needed a combination of visualization and data munging to do so.

Visualization is an often overlooked tool in CS -- for example IDEs do little to zero visualization... only LightTable is starting to break out of the traditional text document. It also shows that depending on the problem visualization & data can be morphed and stretched to provide new insights when others might have walked away.

So why isn't this something that's a part of job interviewing or a bigger part of our normal toolbox as engineers?

sdesol 3723 days ago

> Visualization is an often overlooked tool in CS

It's often overlooked, because generating meaningful data, that can provide visual insight, is usually very difficult. Right now I'm working on a blog post that goes over how you can use motion bubble charts to track code changes and I use GitLab as an example. You can find a draft of the blog at:

http://gitsense.github.io/blog/motion-bubble-charts.html

Note the blog post is still in DRAFT state, so there are broken links and grammatical errors and what not.

Capturing meaningful data at the Enterprise scale, requires a lot of effort. There is a reason why I ended up creating my own real-time process monitoring system:

http://gitsense.github.io/blog/realtime-process-monitoring.h...

What I'm ultimately hoping to do with the metrics, is create a new way to visual Git logs and improve how we approach complex code reviews and diffs.

AlexC04 3723 days ago

I saw a youtube talk on this one ... I think it was this one: https://www.youtube.com/watch?v=O1YP8QP9gLA

Really great stuff. The spot where he gets to a pretty good description of how he uses his flame graph is roughly here: https://youtu.be/O1YP8QP9gLA?t=611

With respect to that blog-post the bit about the truncated towers is a bit of a red herring if you're 100% new to flame graphs.

The real meaty bits are the wide sections.

MikeTheJoker 3723 days ago

Generally you're right that the wide sections are where you want to focus your attention when looking for optimizations. The point I was trying to make in the blog post is that we had to take the flame graph visualization a step further to eliminate the noise obscuring a major hot spot. The large number of broken stacks was one of the first hurdles we had to cross to improve the clarity of the visualization.

BTW, this is a different flame graph and optimization than the one discussed in the YouTube video. We use flame graphs extensively throughout Netflix.

m4dc4pXXX 3723 days ago

Can you write up how you fixed the broken call stacks? I've used Brendan's tools (with java-perf-map, also an awesome tool) to generate flame graphs for Scala code and had no idea I could only see 127 frames.

brendangregg 3723 days ago

We ultimately should be fixing this with BPF, which we'll certainly post instructions for.

surrealvortex 3723 days ago

I'm currently using flame graphs at work. If your application hasn't been profiled recently, you'll usually get lots of improvement for very little effort.

Some 15 minutes of work improved CPU usage of my team's biggest fleet by ~40%. Considering we scaled up to 1500 c3.4xlarge hosts at peak in NA alone on that fleet, those 15 minutes kinda made my month :)

One thing to note once you eliminate the easy pickings is that as you go higher up the call graph, the profiler visualization is often misleading. There may be sections of code without safe-points, and stuff that appears wide on the flame graph may just be getting blamed for adjacent code that doesn't have safe points.

tracker1 3722 days ago

Profiling in general is a really good thing when you're seeing odd load/timing/performance issues... I once found a project was storing its' configuration settings (loaded/cached from DB) in a really badly performing way, an in-memory datatable, with text queries instead of a hashtable (not my design).

A single call wasn't so bad, but the lookup was happening many hundreds of times per request adding seconds to some requests. Wild how much difference a relatively small thing can make.

surrealvortex 3721 days ago

That brings up another distinction - profilers don't distinguish between a method that takes very little time to run but is called very often and another method that is pretty expensive, but is not called very often.

Ultimately, we do care about the total time taken, but the approaches necessary for the two cases above are very different. In many cases, the method that is simply called very often will call for some type of caching solution in the caller, while the more expensive method will require retooling within the method itself.

asragab 3723 days ago

"Middle-Out" approach...wonder where they got that from?

mrgriscom 3723 days ago

That's when I checked to make sure it wasn't an April Fools joke.

f_ 3723 days ago

Very interesting indeed; but somehow I was even more baffled at the package names they seem to be using:

  com.netflix.vulturemonkey.cow.iguana.MacawSquirrel

  com.netflix.ape.serpent.vulture.ApeVultureMantis

  com.netflix.iguanas.monkey.insect.IguanaRabbit

Any idea what's up with that?

mspier 3723 days ago

We really love animals! :-) JK. Just obfuscating class names with animal names before publishing the blog post.

f_ 3723 days ago

Hehe! Thanks for the clarification -- I wondered if everyone was going bananas over at your company! This explains it (:

mspier 3723 days ago

Still better than some unpronounceable old-norse names we had on a few projects. :-)

Illniyar 3723 days ago

Wouldn't such an insanely big call stack be a performance issue in itself?

d33 3723 days ago

Well it depends. Stack is just a data structure. I'd say that the fact alone that you go from a deep call stack to "iterative" version where you can clearly see the stack in the code doesn't automatically make it much better.

geodel 3723 days ago

These call stacks are normal for typical Java enterprise application.

topspin 3723 days ago

Indeed this is normal. Apache Camel produced such huge stack traces they refactored the routing system specifically to reduce AsyncCallback usage and shorten stack traces; at one time Camel would dump traces thousands of lines long. However, pointing this out doesn't actually address the question; is there a performance issue indicated by these huge call stacks?

I've wondered about the question myself when encountering incredibly long stack traces while troubleshooting Java systems. I've also wondered if there is some more general dysfunction indicated. I've see impressive stack traces in C and C++, but nothing quite like what I've found in Java. What is the experience of C# programmers?

ww520 3723 days ago

Recursive call would have huge call stack.

chadlavi 3722 days ago

just think of the millions of dollars they could save if they stopped doing double spaces after sentences