Hacker News new | ask | show | jobs
by compiler-guy 1419 days ago
I sort of owe callgrind a big chunk of my career.

I was working at a company full of PhDs and well seasoned veterans, who looked at me as a new kid, kind of underqualified to be working in their tools group. I had been at the firm for a while, and they were nice enough, but didn't really have me down as someone who was going to contribute as anything other than a very junior engineer.

We had a severe problem with a program's performance, and no one really had any idea why. And as it was clearly not a sophisticated project, I got assigned to figure something out.

I used the then very new callgrind and the accompanying flamegraph, and discovered that we were passing very large bit arrays for register allocation by value. Very, very large. They had started small enough to fit in registers, but over time had grown so large that a function call to manipulate them effectively flushed the cache, and the rest of the code assumed these operations were cheap.

Profiling tools at the time were quite primitive, and the application was a morass of shared libraries, weird dynamic allocations and JIT, and a bunch of other crap.

Valgrind was able to get the profiles after failing with everything else I could try.

The presentation I made on that discovery, and my proposed fixes (which eventually sped everything up greatly), finally earned the respect of my colleagues, and no phd wasn't a big deal after that. Later on, those colleagues who had left the company invited me to my next gig. And the one after that.

So thanks!

6 comments

I have a very similar experience, but with a different profiling tool. When I first graduated from school and joined a big internet company, I'm not that "different". The serving stack was all in C++. My colleagues were really capable but not that into "tools", they'd rather depend on themselves (guess, tune, measure).

But I, as a fresh member in the team, learned and introduced Google perftools to the team and did a presentation of the breakdown of the running time of the big binary. I have to say that presentation was a life-changing moment in my career.

So together with you, I really want to thank those who devoted heavily into building these tools. When I was doing the presentation, I really felt standing on the shoulders of giants and those giants were helping me.

And over years, I used more and more tools like valgrind, pahole, asan, tsan.

Much appreciated!

I've mentioned this before on HN as a way for a "newbie" to look like a superhero in a job very quickly; nice to hear a story of it actually working!

There is so much code in the world that nobody has even so much as glanced at a profile of, and any non-trivial, unprofiled code base is virtually guaranteed to have some kind of massive performance problem that is also almost trivial to fix like this.

Put this one in your toolbelt, folks. It's also so fast that you can easily try it without having to "schedule" it, and if I'm wrong and there aren't any easy profiling wins, hey, nobody has to know you even looked. Although in that case, you just learned something about the quality of the code base; if there aren't any profiling quick wins, that means someone else claimed them. As the codebase grows the probability of a quick win being available quickly goes to 1.

Always find it weird when people berate C++ tooling, Valgrind and adjacent friends are legitimately best in class and incredibly useful. Between RAII and a stack of robust static analyzers you'd have to deliberately write unsafe code these days.
That sounds great until you realise in other languages you get that by default without any tooling. And with better guarantees too (C++ static analysers aren’t foolproof).

Where C++ tooling really lacks is around library management and build tooling. The problem is less that any of the individual tools don’t work and more that there are many of them and they don’t interoperate nicely.

What language that has anything like cachegrind which is the topic of this thread? Cache misuse is one of the largest causes of bad performance these days, and I can't think of any language that has anything built in for that.

Sure other languages have some nice tools to do garbage collection (so does C++, but it is optional, and reference counting does have drawbacks), but there are a lot more to tooling than just garbage collection. Even rust's memory model has places where it can't do what C++ can. (you can't use atomic to write data from two different threads at the same time)

No language has good tools around library and builds. So long as you stick to exactly one language with the build system of that language things seem nice. However in the real world we have a lot of languages, and a lot of libraries that already exist. Let me know what I can use any build/library tool with this library that builds with autotools, this other one from cmake, here is one with qmake (though at least qt is switching to cmake which is becoming the de-facto c++ standard), just to name a couple that handle dependencies in very different ways.

> Even rust's memory model has places where it can't do what C++ can. (you can't use atomic to write data from two different threads at the same time)

Perhaps not in safe Rust, but can you provide an example of something Rust can't do that C++ can? It has the same memory model as C++20: https://doc.rust-lang.org/nomicon/atomics.html

The atomics themselves sure, but I guess often they'll be used as a barrier to protect an UnsafeCell or something, like in the implementation of Lazy<T>: https://docs.rs/lazy-init/0.5.0/src/lazy_init/lib.rs.html#85
To be fair as an outsider to both Rust and Js they seem to have pretty robust package management between cargo and npm, although npm is kinda cheating as collating scripts isn't quite as complex building binaries whereas PIP's absolutely unberable with all the virtual env stuff.

I've been quite lucky with CMake, after the initial learning period I've found everything "just works" as it is quite well supported by modern libs.

Cargo and npm are very robust so long as you stick only to their respective ecosystems. However as soon as you need something from a different eco system they each become hard. The initial import into an ecosystem isn't hard, but the reimport after every update upstream is very annoying.
I love this story. I'm becoming an older dev now and I've often been blindsided by some insight or finding by juniors - it's really great to see & you've always got to make sure they get credit!
I’m surprised to see the attribution to the tools and not your proposed fixes. Sure the discovery was the first step in the order of operations, but can you elaborate on what enabled you to understand the problem statement and subsequent resolution?

There has to be a deeper understanding I think

I can share mine. It's an ads retrieval system. Latency is very sensitive and it has to be efficient. To avoid mem allocations, special hashtables with fixed number of buckets (also open addressing) are used in multiple places in query processing. Default is 1000. However, there are cases that number of elements are only a handful. Then in this case, it fails to utilize the cache, hence slower.

The solution is to tune number of buckets from info derived from the pprof callgraph.

There were others too, like redundant serialization, etc. But this one is the most interesting.

That's surprising. If I was writing this I'd have instrumented the code for the buckets to (optionally) log the use, and probably add an alert.

(being an armchair expert is easy though)

I also heavily used callgrind/cachegrind to tune critical paths in our high performance web proxy, we’re each micro/milliseconds counts… For example, in media type detection that is called multiple times per request (minimum twice for request/response), etc.
Sounds like the solution probably had something to do with switching to passing by reference + other changes I would assume.
A big pain point for using coroutines is having to pass-by-value more frequently due to uncertain lifetimes.. it's jarring when you come from zero copy programming.
That is what many people fail to understand as to why us C programmers dislike C++
Indeed, because languages with reference parameters preceed C for about 15 years, and are present in most ALGOL derived dialects.
I have a similar experience with xdebug for a PHP shop I used to work at. It feels very similar to being a nerd back at school, rescuing peoples home work, and being rewarded with some respect.