| HN Mirror

I should add that the GP only asked about CPU instructions for faster indirect jumps, but I should add that there are at least 4 things that would help a system designed for pervasive dynamic re-optimization of native code:

1. Two special registers (trace_position pointer and trace_limit pointer) for compact tracing of native code. If the position is less than the limit, for all backward branches, indirect jumps, and indirect function calls, the branch target is stored at the position pointer, and the position pointer is incremented. Both trace_position and trace_limit are initialized to zero at thread start, disabling tracing. When the profiling timer handler (presumably SIGVTALRM handler on Linux) executes, it would do some heuristic to determine if tracing should start. If so, it would store the resumption instruction pointer to the start of a thread_local trace buffer, set trace_position to point to the second entry in the trace buffer, and set trace_limit to one after the end of the trace buffer. There is no need to implement a separate interrupt for when the trace buffer fills up, it just turns of tracing; instead, re-optimizing the trace can be delayed until the next time the profiling timer handler is invoked.

2. Lighter weight mechanism for profiling timers that can both be set up and handled without switching from user space to kernel space. Presumably it looks like a cycle counter register and a function pointer register that gets called when the counter hits zero. Either the size of the ABI's stack red zone would be hard-coded, or there would need to be another register for how much to decrement the stack pointer to jump over the red zone when going into the signal handler.

3. Hardware support for either unbiased reservoir sampling or a streaming N-most-frequent algorithm[0] to keep track of the instruction pointer of the instructions causing pipeline stalls. This helps static instruction scheduling for those spots where the CPU's re-order buffer isn't large enough to prevent stalls. (Lower power processors/VLIWs typically don't execute out of order, so this would be especially useful there.) Reservoir sampling can be efficiently approximated using a linear feedback shift register PRNG logical-anded against a mask based on the most significant set bit in a counter. I'm not aware of efficient hardware approximations of a streaming N-most-frequent algorithm. One of the big problems with Itanium is that it relies on very good static instruction scheduling by the compiler, but that involves being good at guessing which memory reads are going to be cache misses. On most RISC processors, the number of source operands is less than the number of bytes per instruction, so you could actually encode which argument wasn't available in cases where, for instance, you're adding two registers that were both recently destinations of load instructions.

4. A probabilistic function call instruction. For RISC processors, the target address would be ip-relative with an offset stored as an immediate value in the instruction. The probability the function is taken would be encoded in the space usually used to indicate which registers are involved. This allows lightweight profiling by calling into a sampling stub that looks back at the function return address. Presumably some cost estimation heuristic would be used to determine the probability embedded in the instruction to make the sampling roughly weighted by cost.

[0] https://en.wikipedia.org/wiki/Streaming_algorithm#Frequent_e...