| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tachyonbeam 1973 days ago

Compiler engineer here. In practice, compilers for higher-level languages often have a lot of difficulty getting anywhere close to the efficiency of comparable C code. If you take Python, for example, you have to do a lot of inlining to eliminate various abstractions. Inlining is actually non-trivial. Yes, inlining, by itself, is an easy program transformation, but knowing where to inline to get the best performance is very hard. If you inline too much, you increase code size, and you lose performance. You have to know precisely which inlining decisions will pay for themselves, and your large codebase might have tens of thousands of call sites and a call hierarchy 40 functions deep. Python also adds the fun little problem that you can redefine any function at run time, which makes it hard for the compiler to know which function you're actually going to be calling. To complicate things further, inlining decisions affect other optimizations. For example, if you inline foo into bar, do you then also inline into bar the functions that foo is calling? Do you unroll the loops from foo into bar? Etc.

Also, there's an aspect that I feel is constantly overlooked, and this is the way that objects are laid out in memory. In Python, JavaScript, Ruby, you have a lot of pointers to objects. You get a lot of pointer-chasing as a result. This is BAD for caches. Each object you touch, each pointer you dereference means pulling in a new cache line. In C, you can design very compact and flat data structures. You can use 32-bit floats, 8-bit, or even one-bit integers if you want. You can have a struct within a struct, with an array inside of it, all without any pointer-chasing.

Modern CPUs are constrained by memory bandwidth, and it's very hard for any programming language to beat C on achievable memory efficiency. What's worse is that we have little to no academic compiler literature on automatically optimizing for memory efficient data layouts. It's an overlooked and understudied problem. You would have to prove that integer values lie within certain ranges (hard) and also do object inlining (collapsing objects into parents) which AFAIK is also hard and not done by any mainstream compiler.

So, yeah, keep thinking that a sufficiently-smart compiler will do everything for you. You will assuredly be very disappointed. Until we have strong AI, the sufficiently smart compiler is basically unobtainium. If you want efficiency, the best route is generally to have less layers of abstraction, or to only rely on compiler optimizations you know for certain will happen.

6 comments

cbsmith 1973 days ago

The sufficiently smart compiler is effectively unobtanium, but that presents a challenge for C as well. C has its own challenges with its memory model & language semantics.

C never was the lowest level of abstraction; there are other abstraction models out there and more still to be invented no doubt. C's model did well (though struggled mightily against Fortran for the longest time) at aligning with processor models of the 80's and 90's. It helps that C's ubiquity meant that to a degree the trail was washing the dog: processor designs were often measured against how they executed code compiled with a C compiler. But past preformance is a poor predictor of future success; who is to say that as processor designs continue to evolve, C's abstraction model won't become increasingly leaky against another? Absent a sufficiently smart compiler, it's entirely possible that C compiler writers will find themselves at a disadvantage.

And that assumes they're competing with a traditional compiler. It's possible, though unlikely, there will be competition from other execution models. As you said memory is often the core area of performance bottlenecks these days, and as terribly inefficient as bytecode interpreters are, they tend to have smaller code sizes. Efficient interprets are hand tuned specifically to make the overhead of bytecode interpretation as efficient as possible. Now, intrinsically they are performing a translation step at runtime that a compiler did before runtime, but one can at least theorize of a model where the interpreter is effectively a specialized decompression algorithm that feeds machine code to the CPU (that's really not that far afield from what happens in hardware in modern cpus and in mobile runtimes). Higher levels of abstraction might allow for more efficient decompression... It's crazy, but not inconceivable.