Clang will very aggressively optimize with -O2 (or whatever the max speed optimization is). It think gcc and MSVC as well.
I was experimenting with a simple indirect-threaded interpreter, which I wanted to write in a "nice" style (1 function per opcode) but have compiled in an "optimized" way (state machine). Playing around in Godbolt, I found out that the compilers managed to detect & optimize even indirect tail calls, i.e. tail calls to function pointers. This was indeed on a toy example (i.e. like 5 functions, not a real interpreter with 100 different opcodes) but still truly amazing.
The reason for this was, I was trying to replicate LuaJIT2 - Mike Pall was complaining that the reason interpreters written in C are slow is, that the compiler cannot optimize & register-allocate a large function containing a switch statement for 100 opcodes. Instead, he wrote the interpreter in asm, and made sure all important variables are kept in registers. My thesis was that we could do the same thing by having a bunch of functions tail-calling each other, with all important variables as parameters (which would be optimized into registers). Better, actually, because the compiler could improve register allocation within each "opcode"/function.
My recollection (without references) is that they can do some amount of guess work. They might see the stack says A->C, but can see the code of A actually calls B, so logically it must have been A->B->C at some point.
Or you can use the rr-debugger, and just return back to the exact sequence of calls, and reject the need to pick between optimization and debugging :D
I usually say “punexpected” because I think what’s meant by “no pun intended” is usually “I only recognized the pun after I said it”. But I agree! Puntention is best.
I was experimenting with a simple indirect-threaded interpreter, which I wanted to write in a "nice" style (1 function per opcode) but have compiled in an "optimized" way (state machine). Playing around in Godbolt, I found out that the compilers managed to detect & optimize even indirect tail calls, i.e. tail calls to function pointers. This was indeed on a toy example (i.e. like 5 functions, not a real interpreter with 100 different opcodes) but still truly amazing.
The reason for this was, I was trying to replicate LuaJIT2 - Mike Pall was complaining that the reason interpreters written in C are slow is, that the compiler cannot optimize & register-allocate a large function containing a switch statement for 100 opcodes. Instead, he wrote the interpreter in asm, and made sure all important variables are kept in registers. My thesis was that we could do the same thing by having a bunch of functions tail-calling each other, with all important variables as parameters (which would be optimized into registers). Better, actually, because the compiler could improve register allocation within each "opcode"/function.