Hacker News new | ask | show | jobs
by pslam 4586 days ago
A big extra cost of virtual functions in the underlying CPU not mentioned in the article: they effectively create a branch target dependency on a pointer chase. Put another way:

1) The virtual function address lookup requires a load from an address which is itself loaded. If neither location is cached, this has the unavoidable latency of two uncached memory accesses. Even at best, this incurs two cached L1 accesses, which is about 8-16 cycles on modern architectures.

2) The function call itself is dependent on the final address loaded above. None of that can proceed until the branch address is known. If cached, all is good and the core correctly predicts execution of a large number of instructions. Best case, the core may still block predicted execution shortly after due to running out of non-dependent instructions, until it knows for sure the address it should have branched to. Worst case, the branch can't proceed until the two memory accesses access.

In any case, nearly all of this is dwarfed by the cost to the compiled code itself: in most cases you can't inline, so simple transformations which could eliminate the function call altogether can't happen.

2 comments

Best case, the core may still block predicted execution shortly after due to running out of non-dependent instructions, until it knows for sure the address it should have branched to. Worst case, the branch can't proceed until the two memory accesses access.

You seem very familiar with these issues, but this doesn't sound right to me. Maybe I'm not understanding your terminology, but don't all modern processors support speculative execution? All instructions (including dependent) are executed, but the results are held in the Reorder Buffer until the branch choice is confirmed. If this is still a large issue, why don't Eli's measurements show it to be?

If the branch target is an address loaded from memory, and there is no cached result for the branch instruction, then there's no way it can predict which instruction to execute next. The target could be anywhere in valid memory.

The reason the measurements don't show it is the micro-benchmark will be predicting very well. In fact it's quite difficult to defeat prediction even for giant codebases, and you probably have bigger issues with L1 thrashing at that point. The more subtle problem is even with prediction, there's a (quite high) limit to the number of unretired speculated instructions. Again, a micro-benchmark won't show that up - you'd need a large function in the inner loop.

I'm making it sound like there's no cost to virtual functions in real applications, but it's there, usually measurable and every little adds up. If anything, I think a better reason to not simply spray "virtual" everywhere is it demonstrates that the author didn't understand the data structures they created.

On the other hand, having no cached result for the branch probably correlates strongly with not having the target in I-cache, which means you may be stalling out anyway. It also implies that the branch is not in the middle of a tight loop.

Regarding the size of the ROB, I was wondering about the size a while ago and found an interesting post from someone who measured it for modern Intel processors: Ivy Bridge (168), Sandy Bridge (168), Lynnfield (128), Northwood (126), Yorkfield (96), Palermo (72), and Coppermine (40). http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/

I agree with you on the virtual part. I'm actually a C programmer more interested in how to implement efficient dispatch for interpreters. Eli (the original author) has some good posts on that as well: http://eli.thegreenplace.net/2012/07/12/computed-goto-for-ef...

Can profile-guided optimization realise that a certain virtual function almost always resolves to a specific implementation and have a conditional check to inline or optimize when needed?

I'm not overly experienced with complicated OO systems, but sometimes it seems the OO is just an abstraction for convenience, but runtime will always take a particular path.

Note you've just described inline caching [1]. A big research topic in the 90s, I'm sure this is pretty much a non-issue these days.

[1] http://en.wikipedia.org/wiki/Inline_caching

Microsoft's PGO/LTCG implementation does just this. GCC can do something similar as well.
My understanding is that good virtual machines basically do this sort of profiling and optimization at runtime and JIT compile specializations as necessary.

Does anybody know why JIT isn't done in classically AOT compilers? Is JIT overhead generally higher than cost savings of the optimizations?

> Does anybody know why JIT isn't done in classically AOT compilers?

One (admittedly incomplete) answer is that AOT compilers try to replicate many of the wins that JIT compilers get from runtime specialization by including a profile-guided optimization pass instead, which specializes ahead of time, using data logged from what you hope is a representative example of runtime.

Good JIT compilers can do things like optimizing fast paths, discovering latent static classes in highly dynamic languages, etc. These kinds of optimizations can also be done AOT, if you have good profile data and suitable analysis & optimization passes.

The pros/cons of each approach are not entirely resolved, and you will find varying opinions. Part of the problem with making a direct comparison is that there are large infrastructural inconveniences with switching from one approach to the other. A good JIT is a quite pervasive beast, not something you can just tack on as a nice-to-have. PGO is somewhat infrastructurally easier to add to an existing AOT compiler. Therefore, if you can do most of what JIT does via PGO, you would prefer to do that, were you the maintainer of an existing AOT compiler. Whether you really can is afaik a bit of an open question.

I think something that's often overlooked in this discussion is the language semantics differences. So we're not just comparing AOT with JIT (or why not JIT an AOT compiled app...) We're almost always also comparing C++ to the JVM/CLR worlds.

And then the point is that most optimizations a JIT can do that an AOT cannot are particulaly important where the language semantics are "too" flexible. If your code has lots and lot of virtual calls; lots of exceptions with unpredictable code flow - well, sure, it's really important to elide that flexibility where it's not actually used. That's kind of like JS Vm's nowadays speculatively type their untyped objects - it's a huge win, and not possible statically.

But the point is - these optimizations are critical because the languages don't allow (or encourage) code to disable these dynamic features. In C++ this can be helpful; but how often is dynamic devirtualization really going to matter? I mean, you can statically devirtualize certain instances (e.g. whole-program optimization reveals only two implementations and replaces a virtual call with an if), but the real code-could by any subtype but actually isn't scenario just isn't one that comes up often.

The consequence is that C++ gets most of the benefits of a JIT simply because the JIT is solving problems C++ compilers don't need to solve. The cost is that the compiler wastes inordinate amounts of time compiling your entire program as optimally as it can, even though it only has a few hotspots.

I'm not an expert on the topic, but JIT is not allowed in certain places - like game consoles, ios, etc, to the point where even simpler trampoline (ffcall, libffi) are not allowed too. As such AOT helps there where JIT can't be used, but does not support absolutely everything (C# templates when comes to types that are not yet loaded I guess?).
With C# generics on non-struct types, the emitted code is the same regardless of the type. On struct-types, it's necessarily different because there's no object header and the size of the objects can differ.

But, I can't think of a way to get an unknown struct type into a generic function without boxing (which makes it no longer a struct type). So for an AOT compiler it shouldn't be an issue. And there's probably some clever way to emit generic code for structs, too.

But generics have their own implementation difficulties and Mono didn't support them for a while in AOT.