Hacker News new | ask | show | jobs
by JoeAltmaier 4580 days ago
Wow I'd need some evidence to believe JIT can do a better job than a 'static' compiler. What can JIT do to optimize 'much more so'? How about a little more? If you know some common conditional jump stats you can do slightly better. Is there anything else?
4 comments

A static compiler has to generate code that handles every possible path through the code. This tends to pessimize code downstream of a control flow merge point, because such code has to be compiled assuming it can be reached via multiple paths. A trace compiler can compile only the paths that are actually taken on a given workload, and compile the (dynamically) uncommon cases into exits to the interpreter.

The firewall rule example Mike Pall gives is a pretty good one. You have code that has many different paths through it, but only specific ones are triggered, dynamically, by a particular attack. Those can be compiled into traces that assume the other paths are not taken (except to guard the possible control flow splits with trace exits). This can allow an optimizer to make much more aggressive assumptions on the actually-taken paths.

My point exactly. This slight optimization is all you get. I think this technology is being oversold if that's all they got.
You might want to read about Dynamo:

http://www.hpl.hp.com/techreports/1999/HPL-1999-78.pdf

"Contrary to intuition, we demonstrate that it is possible to use a piece of software to improve the performance of a native, statically optimized program binary, while it is executing. Dynamo not only speeds up real application programs, its performance improvement is often quite significant."

Oh, well ... pasting my standard rant on this:

This is a common misinterpretation of the Dynamo paper: they compiled their C code at the lowest optimization level and then ran the (suboptimal) machine code through Dynamo. So there was actually something left to optimize.

Think about it this way: a 20% difference isn't unrealistic if you compare -O1 vs. -O3.

But it's completely unrealistic to expect a 20% improvement if you'd try this with the machine code generated by a modern C compiler at the highest optimization level.

Claiming that JIT compilers outperform static compilers, solely based on this paper, is an untenable position.

But, yes, JIT compilers can outperform static compilers under specific circumstances. This has more to do with e.g. better profiling feedback or extra specialization opportunities. But this is not what this paper demonstrates.

Many compiler optimizations have strong non-linear costs in terms of the number of control flow edges. A static compiler has to punt at a certain complexity. OTOH a JIT compiler is free to ignore many edges, since it may fall back to an interpreter for cold paths or attach new code anytime later if some edges become hot.

One interesting example is auto-vectorization (SIMDization) where static compilers have to generate code for all possible combinations of vector alignments in case the underlying alignment of the participating vectors is not statically known. This quickly gets very expensive in terms of code space. OTOH a JIT compiler can simply specialize to the observed vector alignment(s) at runtime, which show almost no variation in practice.

Thank you for your comment. Very interesting points.

Isn't it wrong to say "compiled their C code at the lowest optimization level" though? As I read the paper, they compiled their C code at -O2. It's hard to work backwards in time to know exactly what that meant when they published their paper, but I suspect it meant something very similar to the current gcc definition:

-O2 Perform nearly all supported optimizations that do not involve a space-speed tradeoff.

Dynamic binary translation can sometimes get performance improvements. This is because you would still statically compile your program, and then at runtime, the DBT system dynamically decodes and re-encodes the binary instructions. With a DBT system, you can do tracing, by identifying hot paths and such and lining those basic blocks up in your code cache.
you have a very large number of options so getting code locality between the ones that are applicable is important, at a guess. performance in large state machines is very much locality dependent but this way you get to do this dynamically not statically.