| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sgillen 2743 days ago
	Do you know if an optimizing compiler will ever make this transformation?

2 comments

chc4 2743 days ago

Every single C compiler worth mentioning will turn a switch statement into a jump table. The difference between a jump table and computed goto is only that one is `jmp [ecx+eax8]` and the other is `mov edx, [ecx+eax8]; jmp edx`. The later is faster because of weird branch prediction reasons, and no bounds checks in the switch.

wahern 2742 days ago

> Every single C compiler worth mentioning will turn a switch statement into a jump table

Only if the case values are small integers with a compact range. But if your opcodes are large and/or sparse (more common in non-VM scenarios) compilers don't optimize switch statements very well.

There's a ton of research on switch statement optimization, but most of it doesn't translate to real-world scenarios very well. Compilers will give up fairly quickly on trying to generate a jump table because the work necessary to map the case value to an index at runtime can easily cost more than its worth in many situations.

Here's some real-world code which you can easily benchmark and compare: https://github.com/wahern/hexdump/blob/master/hexdump.c#L752

Just flip the VM_FASTER macro. Using `hexdump -C < 64MB.random.file > /dev/null`, on my circa 2012 Mac Mini the switch statement is 80% slower than the computed goto and on my circa 2018 Macbook Pro it's about 15% slower. (See my note elsethread about the post-Haswell branch prediction.) And this is despite the fact that the opcode range is 0-32 without holes!

Also, with computed gotos you can remove the jump table altogether. You can make your opcodes actual addresses (just like native code) if you can make the label addresses visible to the code generator. (See my post elsethread for how to make them visible.)

P.S. While I implemented hexdump.c using a VM mostly for fun, the end result not only out performs GNU od and BSD hexdump, but IMHO the code is much easier to read, too.

tom_mellior 2743 days ago

> Every single C compiler worth mentioning will turn a switch statement into a jump table.

Yes, but that is not the optimization discussed here. The optimization discussed here is looking ahead to the next bytecode instruction's opcode, using it as a key into a different jump table, and jumping to that target. No C compiler I'm aware of does that.

C compilers could try to match on common patterns in interpreter implementation (a switch within a loop, keyed on certain bits of a piece of data read mostly linearly from an array) and heroically generating a dispatch table, but that would still fail in the cases of branches where you don't want to call DISPATCH because execution does not necessarily proceed to the next instruction in the array.

vardump 2743 days ago

HN formatting ate some * from the parent post:

  jmp [ecx+eax * 8]

  and the other is

  mov edx, [ecx+eax * 8]; 
  jmp edx

> The later is faster because of weird branch prediction reasons...

Things like that are microarchitecture dependent. Might be true on particular CPUs.

Of course it's possible separate MOV could be executed much earlier in the out of order pipeline while JMP effective address calculation might not. So JMP address (edx) would be already resolved by the time JMP is actually executing.

dawkins 2743 days ago

Do you know if the Go compiler does this too?

Edit: Nevermind, it doesn't: https://github.com/golang/go/issues/5496

ape4 2743 days ago

When you are code a `switch` you're telling the compiler what you want. So you get decades of compiler optimization working for you.

banachtarski 2743 days ago

Oh psh optimizing compilers go way beyond this. In fact, even hand implemented VMs also generally go further. The technique above is called tail dispatch and is really just the beginning.

astrobe_ 2743 days ago

As far as I am concerned, I stopped worrying and use direct threaded code (DTC): a VM instruction is just a pointer to the C function implementing it. It's reasonably portable and one can expect stable performance across platforms. Because, when you look at the benchmarks you find out that you are talking about single-digit percent difference between different techniques sometimes, and variations in cache size, branch prediction, even unaligned memory fetch tax on odd targets, all these details can make a different king-of-the-hill on a different processor. Compilers can also deal with a technique better than another.

You can gain far more by working on your instruction set. Making it easily extendable is better, making it easy to transpose interpreted code into native code is better, because you can easily benchmark and optimize as you have a more and more clear idea of what will be idiomatic code. For the DTC technique, a bonus is that there's normally no difference between instructions and library bindings.

tom_mellior 2742 days ago

> direct threaded code (DTC): a VM instruction is just a pointer to the C function implementing it.

Nitpick: This sounds more like subroutine threading than direct threading. https://en.wikipedia.org/wiki/Threaded_code

In subroutine threaded code, each function implementing an instruction returns before the next one is called. In direct threading, each code snippet implementing an instruction loads the next address and jumps to it directly without going through a return and another call.

astrobe_ 2742 days ago

It sounds like it, but it is not. It gets confusing fast when one talks about these things in even slightly accurate ways.

In DTC to call a function of the VM code (ie a another piece of DTC) you typically have a "call" virtual instruction, followed by the virtual address to call. you also have a virtual "return" function that pops the virtual instruction pointer from the virtual call stack.

Subroutine threaded is really just a very primitive form of AoT compilation.

tom_mellior 2741 days ago

None of this explains why you think that your code in which "a VM instruction is just a pointer to the C function implementing it" is direct threaded. If you call a C function for each VM instruction, and that function returns before the next C function is called, then your threading is not direct.

toolslive 2743 days ago

indeed. I think the second optimization on the list is super-instructions: frequent combinations of 2 (or even 3) instructions are transformed into 1 equivalent instruction, lowering the average cycle count.

tonyjstark 2743 days ago

Where can I learn about techniques like that? Any recommendation would be very appreciated!

tom_mellior 2743 days ago

Look up work by Stefan Brunthaler, like this overview paper ("Virtual-Machine Abstraction and Optimization Techniques"): https://www.sba-research.org/wp-content/uploads/publications...

Make sure to check out the references. I especially like "Ertl, M. A. and D. Gregg, The structure and performance of efficient interpreters, Journal of Instruction-Level Parallelism 5 (2003)"

yazr 2743 days ago

There goes the weekend...

Do you know of any recent efforts to combine ML and VM optimizations ?

(I read about the VS compiler and profile-based optimizations)

tom_mellior 2743 days ago

This one is good: http://www.tara.tcd.ie/bitstream/handle/2262/64047/Optimizin...

Though you might not think of hill climbing as "machine learning".

vkazanov 2743 days ago

Those techniques are a relatively niche thing. People don't write bytecode interpreters daily. Couple of months ago I wrote a few articles on the topic. Unfortunately, they are in Russian. I'll probably translate 'em at some but for now...

the repo and performance measurements are on Github[1]. It includes a couple of dispatch techniques, register caching, etc.

[1]: https://github.com/vkazanov/bytecode-interpreters-post

toolslive 2743 days ago

article below contains an overview and more pointers.

http://realityforge.org/code/virtual-machines/2011/05/19/int...