As usual, memory access is the expensive operation. If the virtual function is already in L1 cache a virtual function should be only a minor slowdown. If it's all the way off in main memory it will be significantly slower.
Exactly. Calling a virtual function in a tight loop, almost indistinguishable from a direct (non-inlined) function call. Calling a virtual function every now and then, much more expensive, but also not as likely to matter much to the performance of your program on the whole anyway.
Still worth noting that every VM like Java or .NET or LuaJIT will optimize for the case that a virtual call usually has only one or two commonly invoked target functions. They will use a conditional fast-path for the common case(s) and fall back to slower virtual calls or less optimized paths for megamorphic calls. I don't know if any C++ compilers do this kind of thing, but I would be surprised if they did not.
As you suggest, it's a little misleading to talk about the runtime cost of a virtual or indirect function call in isolation. The most significant cost is often the resulting inability of the compiler to inline and perform further optimizations across the caller and callee.
Nice, I had a feeling they did that, thanks for the link.
I fully agree, missed optimization opportunity from not being able to inline is usually a bigger effect, because loops dominate the runtime and that's where those optimizations count.
Virtual function / function pointer calls can still have a big performance impact, especially if it could be inlined away.
Compare C standard library qsort with C++ std::sort performance for sorting integers. The difference is huge in favor of std::sort. If you write your own qsort and put it in a header file so the indirect calls can be inlined, they should be equal.
i've seen virtuals completely removed and inlined by gcc when there is enough information for it to do that, and a simple enough use case... probably other compilers are similar.
Even then, the article and co-commentators exaggerate the cost and factors you should worry about for virtual calls. Unless you're optimizing specifically for an in-order CPU, the only cost of virtual functions that really matters in the general/high level case is not being able to inline them.
An important factor is whether it can be branch predicted. If the virtual call nearly always goes to the same method implementation (e.g., always an OpenGL renderer or always a DirectX renderer), it can actually be pretty cheap. They get expensive mainly when the call is constantly jumping to implementations for different object types.
A virtual call in C++ is just a vtable lookup then a direct C call. Literally one pointer away. (or perhaps one pointer then one integer addition away? For the offset)
However, the address of this C call is calculated, which means that branch prediction (and the cost of potential misprediction) is involved. So additional cost we're talking about, is roughly a cost of L1 read (3-4 clocks by itself) - plus (0-<cost-of-branch-misprediction))
Is it that simple for virtual methods in complex class hierarchies? If the class hierarchy isn't linear, how can you assign simple absolute vtable indices for methods?
Yes it is. The C++ ABI defines how any permitted inheritance hierarchy is mapped into a surjective, ordered set of vtables. Also, calls into methods of other classes in the hierarchy offset the this pointer accordingly.
It feels counterintuitive that these two mechanisms are sufficient; but they are.
The cost of virtual calls (and calls in general) in the article is greatly exaggerated. Jumps, calls and indirect calls have an effective latency of zero as they are resolved in the fetch stage (and speculated if the target is not known) and a troughput of one taken jump every one or two cycles.
In a virtual call, the offset computation and load is only used to compute the jump target (which is only used later on commit), which means that is not on the critical path [1].
They consume execution and load bandwith, but that's not normally the bottleneck. The cost of virtual cals is usually due to missed optimizations and the possibility of misprediction.
[1] technically if the load is delayed long enough, the CPU might be prevented from committing the speculated instructions and eventually exhaust its reoerder buffer and stall.
> The cost of virtual cals is usually due to missed optimizations and the possibility of misprediction.
Yep (plus one L1 read to get that VMT pointer). And the OP is pretty much consistent with this (at least within declared "indications of the order of magnitude" precision).
Eric Brumer has a great talk on this: https://channel9.msdn.com/Events/Build/2013/4-329