Hacker News new | ask | show | jobs
by rocqua 1984 days ago
Could such a compiler include the runtime for this in the binary as an option? That might make it a lot more likely to be used by people, because it is all nice and stand-alone.

Who would benefit from this most? Is the benefit so diffuse it would almost have to be an open-source project without funding? Or could there be parties that see enough of an advantage to fund this?

I guess you could try and get a certain instruction set vendor (probably RISC-V, maybe ARM or x86 based) to have this as a boost for their chips. I guess the "functions are pointers to blocks" compilation could benefit from hardware acceleration.

1 comments

You could presumably statically link in the runtime. Also, without the dynamically-optimizing runtime, it would run just fine, just a bit slower than normal native code due to the extra indirection. Lots of indirect calls also increase the chances of address mispredictions due to tag collisions in the BTB (Branch Target Buffer).

Function calls/jumps through arrays of pointers are how virtual method calls/optimized virtual method tail calls are executed. Though, in this case, the table offsets would be held in a register instead of immediate values embedded within the instruction. I'm not aware of any instruction set where they've decided it's worthwhile making instructions specifically to speed up C++ virtual member function dispatch, so I doubt they'd find optimizing this worthwhile.

Also, if things go according to plan, your hot path is a long strait run of code, with only occasional jumps through the table.

I should add that the GP only asked about CPU instructions for faster indirect jumps, but I should add that there are at least 4 things that would help a system designed for pervasive dynamic re-optimization of native code:

1. Two special registers (trace_position pointer and trace_limit pointer) for compact tracing of native code. If the position is less than the limit, for all backward branches, indirect jumps, and indirect function calls, the branch target is stored at the position pointer, and the position pointer is incremented. Both trace_position and trace_limit are initialized to zero at thread start, disabling tracing. When the profiling timer handler (presumably SIGVTALRM handler on Linux) executes, it would do some heuristic to determine if tracing should start. If so, it would store the resumption instruction pointer to the start of a thread_local trace buffer, set trace_position to point to the second entry in the trace buffer, and set trace_limit to one after the end of the trace buffer. There is no need to implement a separate interrupt for when the trace buffer fills up, it just turns of tracing; instead, re-optimizing the trace can be delayed until the next time the profiling timer handler is invoked.

2. Lighter weight mechanism for profiling timers that can both be set up and handled without switching from user space to kernel space. Presumably it looks like a cycle counter register and a function pointer register that gets called when the counter hits zero. Either the size of the ABI's stack red zone would be hard-coded, or there would need to be another register for how much to decrement the stack pointer to jump over the red zone when going into the signal handler.

3. Hardware support for either unbiased reservoir sampling or a streaming N-most-frequent algorithm[0] to keep track of the instruction pointer of the instructions causing pipeline stalls. This helps static instruction scheduling for those spots where the CPU's re-order buffer isn't large enough to prevent stalls. (Lower power processors/VLIWs typically don't execute out of order, so this would be especially useful there.) Reservoir sampling can be efficiently approximated using a linear feedback shift register PRNG logical-anded against a mask based on the most significant set bit in a counter. I'm not aware of efficient hardware approximations of a streaming N-most-frequent algorithm. One of the big problems with Itanium is that it relies on very good static instruction scheduling by the compiler, but that involves being good at guessing which memory reads are going to be cache misses. On most RISC processors, the number of source operands is less than the number of bytes per instruction, so you could actually encode which argument wasn't available in cases where, for instance, you're adding two registers that were both recently destinations of load instructions.

4. A probabilistic function call instruction. For RISC processors, the target address would be ip-relative with an offset stored as an immediate value in the instruction. The probability the function is taken would be encoded in the space usually used to indicate which registers are involved. This allows lightweight profiling by calling into a sampling stub that looks back at the function return address. Presumably some cost estimation heuristic would be used to determine the probability embedded in the instruction to make the sampling roughly weighted by cost.

[0] https://en.wikipedia.org/wiki/Streaming_algorithm#Frequent_e...