Hacker News new | ask | show | jobs
by kevingadd 4585 days ago
The CLR's inlining for virtual calls is constrained specifically to interfaces, not to all uses of 'virtual', IIRC. (Interfaces are still very much virtual calls, they just don't use the 'virtual' keyword.)

See [1] for an example where the CLR fully inlines a virtual call (through an interface, specifically)

The call is most definitely virtual (or dynamic if you prefer that term), not statically-dispatched. It just happens to be performed through an interface. I suspect the CLR optimizes this because interfaces are incredibly common (IEnumerable, etc.)

[1] http://msdn.microsoft.com/en-us/library/ms973852.aspx

1 comments

I get even worse results using an interface for the Add function. That is, if I do "new X() as InterfaceType" then call the function, the performance is 5x worse than if I don't cast to the interface. This is in a tight loop doing an add.

Do you have any actual examples of making an interface call that gets inlined? This post[1], dated 2004 (later than the MSDN article you referenced) from Eric Gunnerson says:

"all the compiler knows is that it has an IProcessor reference, which could be pointing to any instance of an type that implementes IProcessor. There is therefore no way for the JIT to inline this call - the fact that there is a level of indirection in interfaces prevents it. That's the source of the slowdown."

He goes on to say that Sun does do something since Java makes everything virtual, and the CLR could do it in theory, but doesn't.

I skimmed through the linked article you provided but didn't find any mention of inlining interface method calls. On the excellent performance of virtual/interface calls, it says:

"the combination of caching the virtual method and interface method dispatch mechanisms (the method table and interface map pointers and entries) and spectacularly provident branch prediction enables the processor to do an unrealistically effective job"

1: http://blogs.msdn.com/b/ericgu/archive/2004/03/19/92911.aspx...

I've seen interface calls be inlined in action on the modern CLR when looking at disassembly. I don't understand why you would have expected the interface call to be faster than a normal non-virtual call? The interface call always needs a type check before the inlined call body in case of polymorphism; it can't be as fast as a normal call.

Or are you saying the interface call is 5x slower than a virtual call? That definitely isn't right.

I expect an inlined interface or virtual call to be the same as an inlined non-virtual call. But since the CLR (4.5, Windows 7 x64, using 32 or 64-bit codegen) won't emit an inlined virtual/interface call for int Add(int, int) -- it's slower.

In my simple program doing a loop, calling an Add function on an interface, it is definitely making a function call each time. It unrolls 4 times, and loads the function pointer once per iteration - I'd have though it would only load it once overall. Loop is 89 bytes. There is no conditional inside the loop to check for the type.[1]

If I change it to not use the interface (don't cast to the interface type), it's unrolled and inlined. Loop is 34 bytes.[2]

It's the same on 32-bit, except there's no unrolling. The non-virtual loop body is 2 instructions (inc, add). The interface has a push, 3 movs and a call. The virtual one requires two extra movs (to load the function pointer - with an interface the address is embedded as a literal).

Shrug. Maybe it still doesn't work with value types? I started it without VS then broke in with the debugger to get the disassembly.

The loop is doing "y = x.Add(y, i)" where y is a local.

Edit: Aha! Using an interface method (not virtual) and strings, I was able to get inlining. I guess the CLR is still weak in dealing with value types.

1: Start of the loop using an interface:

  lea r8d,[rdi-1]
  mov rbx,qword ptr [FFEEFE60h]
  mov edx,eax
  mov rcx,rsi ; rsi is the object pointer
  lea r11,[FFEEFE60h] ; I am embarrassed to admit I don't know what r11 is doing
  call rbx ; just does lea eax[rdx+r8], ret
  ; similarly 3 more times then loop
2: Without using the interface, the loop body:

  lea eax,[r8-1] ; r8's the counter
  add ecx,eax   
  lea edx,[rcx+r8]
  lea ecx,[rdx+rax]
  lea eax,[r8+2]
  add ecx,eax
  ; then loop
Nice work digging in and figuring out how to trigger it. I fiddled around some earlier and wasn't able to reproduce the behavior I saw before, so I gave up. :) You are correct that the CLR does a poor job optimizing value types, and I probably made the same mistake (i.e. used a struct)