Hacker News new | ask | show | jobs
by s_kanev 2239 days ago
This was the case a few years back when the fastest pools were implemented with recursive data structures (e.g. linked lists for the freelists in gperftools).

In the new tcmalloc (and, I think, hoard?) the fastest pools are essentially slabs with bump allocation, so the fastest (and by far, the most common) calls are a grand total of 15 or so instructions, without many cache misses (size class lookups tend to stay in the cache). Call overhead can be a substantial chunk of that.

2 comments

First of all, let’s be clear: for a lot of algorithms adding call overhead around them is free because the cpu predicts the call, predicts the return, and uses register renaming to handle arguments and return.

But that’s not the whole story. Malloc perf is about what happens on free and what happens when the program accesses that memory that malloc gave it.

When you factor that all in, it doesn’t matter how many instructions the malloc has. It matters whether those instructions form a bad dependency chain, if they miss cache, whether the memory we return is in the “best” place, and how much work happens in free (using the same metrics - dependency chain length and misses, not total number of instructions or whether there’s a call).

> First of all, let’s be clear: for a lot of algorithms adding call overhead around them is free because the cpu predicts the call, predicts the return, and uses register renaming to handle arguments and return.

This is only thinking at the hardware layer. There are also effects at the software layer.

Function calls not known to LTO prevent all sorts of compiler optimizations as well. In addition, as Jeff says on this thread, not being able to inline free means you get no benefit from sized-delete, which is a substantial improvement on some workloads. (I'd cite the exact number, but I'm not sure Google ever released it.)

Source: I literally worked on this.

LTO I will grant you. I don't think you need to meaningfully inline the body of delete to get sized-delete to work effectively. You do want to statically link your sized operator delete body though, obviously. As of the last time I tried this, trying to inline the actual body was difficult to make effective, though.

(parent: read username :))

Hi, Andrew. :)

Agree, I shouldn't have been loose wrt the distinction between inlining and LTO. I think I also ran that experiment with inlining sized-deletes and came to the same (surprising, to me) conclusion.

Let’s be clear: even without LTO, the compiler knows what malloc and free do at a semantic level.

I buy the sized delete win, I’ve seen that too, in some niche cases. But that’s a far cry from the claim that not inlining malloc is bananas. If that’s all you got then this sounds like a stunning retreat from the thesis of Jeff’s post. You’re just saying that you’ve got a free that benefits from size specialization, which has nothing to do with whether malloc got inlined.

> even without LTO, the compiler knows what malloc and free do at a semantic level.

  int *a = ...;
  int b = *a;
  malloc(42);
  b = *a;
  printf(b);
If malloc is an opaque function, can the compiler nonetheless eliminate the second `b = *a` load as dead because it "knows what malloc does"? Certainly not, right?
https://godbolt.org/z/z-92qR

Interestingly, the compiler emits no call to malloc, even. What!?

With -fno-builtin. It's still "cheating" as b is local.

https://godbolt.org/z/U5VGxt

Without cheating:

https://godbolt.org/z/rzcXg2

It doesn't optimize the two stores to b away.

The compiler knows the semantics of malloc. Malloc as it appear in the source code doesn't even need to map 1:1 to calls to the library function, the compiler can remove and coalesce calls at will.
You can still get a benefit from sized delete, even without inlining.
You are not wrong but also missing the point of the person you're replying to.