Every single C compiler worth mentioning will turn a switch statement into a jump table. The difference between a jump table and computed goto is only that one is `jmp [ecx+eax8]` and the other is `mov edx, [ecx+eax8]; jmp edx`. The later is faster because of weird branch prediction reasons, and no bounds checks in the switch.
> Every single C compiler worth mentioning will turn a switch statement into a jump table
Only if the case values are small integers with a compact range. But if your opcodes are large and/or sparse (more common in non-VM scenarios) compilers don't optimize switch statements very well.
There's a ton of research on switch statement optimization, but most of it doesn't translate to real-world scenarios very well. Compilers will give up fairly quickly on trying to generate a jump table because the work necessary to map the case value to an index at runtime can easily cost more than its worth in many situations.
Just flip the VM_FASTER macro. Using `hexdump -C < 64MB.random.file > /dev/null`, on my circa 2012 Mac Mini the switch statement is 80% slower than the computed goto and on my circa 2018 Macbook Pro it's about 15% slower. (See my note elsethread about the post-Haswell branch prediction.) And this is despite the fact that the opcode range is 0-32 without holes!
Also, with computed gotos you can remove the jump table altogether. You can make your opcodes actual addresses (just like native code) if you can make the label addresses visible to the code generator. (See my post elsethread for how to make them visible.)
P.S. While I implemented hexdump.c using a VM mostly for fun, the end result not only out performs GNU od and BSD hexdump, but IMHO the code is much easier to read, too.
> Every single C compiler worth mentioning will turn a switch statement into a jump table.
Yes, but that is not the optimization discussed here. The optimization discussed here is looking ahead to the next bytecode instruction's opcode, using it as a key into a different jump table, and jumping to that target. No C compiler I'm aware of does that.
C compilers could try to match on common patterns in interpreter implementation (a switch within a loop, keyed on certain bits of a piece of data read mostly linearly from an array) and heroically generating a dispatch table, but that would still fail in the cases of branches where you don't want to call DISPATCH because execution does not necessarily proceed to the next instruction in the array.
jmp [ecx+eax * 8]
and the other is
mov edx, [ecx+eax * 8];
jmp edx
> The later is faster because of weird branch prediction reasons...
Things like that are microarchitecture dependent. Might be true on particular CPUs.
Of course it's possible separate MOV could be executed much earlier in the out of order pipeline while JMP effective address calculation might not. So JMP address (edx) would be already resolved by the time JMP is actually executing.
Oh psh optimizing compilers go way beyond this. In fact, even hand implemented VMs also generally go further. The technique above is called tail dispatch and is really just the beginning.
As far as I am concerned, I stopped worrying and use direct threaded code (DTC): a VM instruction is just a pointer to the C function implementing it. It's reasonably portable and one can expect stable performance across platforms. Because, when you look at the benchmarks you find out that you are talking about single-digit percent difference between different techniques sometimes, and variations in cache size, branch prediction, even unaligned memory fetch tax on odd targets, all these details can make a different king-of-the-hill on a different processor. Compilers can also deal with a technique better than another.
You can gain far more by working on your instruction set. Making it easily extendable is better, making it easy to transpose interpreted code into native code is better, because you can easily benchmark and optimize as you have a more and more clear idea of what will be idiomatic code. For the DTC technique, a bonus is that there's normally no difference between instructions and library bindings.
In subroutine threaded code, each function implementing an instruction returns before the next one is called. In direct threading, each code snippet implementing an instruction loads the next address and jumps to it directly without going through a return and another call.
It sounds like it, but it is not. It gets confusing fast when one talks about these things in even slightly accurate ways.
In DTC to call a function of the VM code (ie a another piece of DTC) you typically have a "call" virtual instruction, followed by the virtual address to call. you also have a virtual "return" function that pops the virtual instruction pointer from the virtual call stack.
Subroutine threaded is really just a very primitive form of AoT compilation.
None of this explains why you think that your code in which "a VM instruction is just a pointer to the C function implementing it" is direct threaded. If you call a C function for each VM instruction, and that function returns before the next C function is called, then your threading is not direct.
indeed. I think the second optimization on the list is super-instructions: frequent combinations of 2 (or even 3) instructions are transformed into 1 equivalent instruction, lowering the average cycle count.
Make sure to check out the references. I especially like "Ertl, M. A. and D. Gregg,
The structure and performance of efficient interpreters, Journal of Instruction-Level Parallelism
5 (2003)"
Those techniques are a relatively niche thing. People don't write bytecode interpreters daily. Couple of months ago I wrote a few articles on the topic. Unfortunately, they are in Russian. I'll probably translate 'em at some but for now...
the repo and performance measurements are on Github[1]. It includes a couple of dispatch techniques, register caching, etc.