Hacker News new | ask | show | jobs
by teddyknox 3386 days ago
What differentiates LLVM IR from, say, JVM bytecode? I'm curious because there's a stalled out GNU project under GCC called GCJ that would compile JVM bytecode to native. I wonder if the issue became that statically linking in the JVM in the binary resulted in a lot of bloat, or something more intrinsic to the suitability of JVM bytecode as a platform-independent IR...
5 comments

Off the top of my head:

* JVM bytecode is stack-based, whereas LLVM uses "infinite registers" in SSA form

* being in SSA form makes it convenient to consume in compiler passes, but comes with quirks that mean you don't really want to write in that style manually: the mind-bending phi instructions, definitions must dominate uses, simple ops like incrementing a variable really means creating a new variable, etc.

* JVM bytecode carries a lot of Java-level information, for instance if you have N classes with M methods each in source, you will typically find N classes with M methods in bytecode too. A lot of keywords in Java have an equivalent in bytecode (e.g. private, protected, public, switch, new...)

* in contrast, LLVM IR feels closer to C (it only knows about globals, arrays, structs and functions). It exposes lower level constructs like vector instructions, intrinsics like memcpy

* JVM bytecode is well specified: anyone armed with the pdf [1] can implement a full JVM. LLVM IR is somewhat loosely defined and evolves based on the needs of the various targets

* JVM bytecode is truly portable, whereas target ABI details leak into LLVM IR. A biggie is 32 bit vs 64 bit LLVM IR.

[1] https://docs.oracle.com/javase/specs/jvms/se8/jvms8.pdf

> JVM bytecode is truly portable, whereas target ABI details leak into LLVM IR. A biggie is 32 bit vs 64 bit LLVM IR.

From a few LLVM meetings and Apple's work on bitcode packaging, I think there is some work to make LLVM IR actually architecture independent.

PNaCl is noteworthy here too, insofar as its IR is very much based on LLVM IR (but portable and stable).
There are lots of Java AOT compilers to native code, namely most commercial JDKs for embedded development.

The biggest problem with GCJ was that most people left the project when OpenJDK was made available, and decided to work on it instead.

LLVM IR is an IR format, :). Basically LLVM is just a abstract RISC, whereas the JVM is a lot of that with a truck load of very high level instructions. One could implement these as a superset of LLVM, but that's not what LLVM is. You, mostly, can JIT LLVM IR and use it as a generic bytecode vm: but it's really designed for static copmilation.
When will we see processors with a subset of llvm ir in hardware ?
That would be a terrible idea. LLVM IR has an inefficient in-memory representation; every IR node is separately heap-allocated via new and chained together via pointers. This is probably a suboptimal design for the compiler, but it would go from suboptimal to disastrous if directly used for execution.
I don't think an implementation of LLVM IR for execution would require the same in-memory representation.
Exactly, my point was more that as we're having C-optimized processors and microcontrollers or even java or lisp based ones, maybe once there is many software readily compileable with llvm maybe architectures could be optimized for it (but not directly porting it, just having a tiny final step llvm based microcode. By example of course you can't have infinite registers as ssa. But it ca' influence your instruction set.
That's not the only reason why you wouldn't want to run LLVM IR directly (if it were possible). You still have the types, which are useless at runtime, and the unlimited register file to deal with.

You could make an ISA which is similar to LLVM IR, but there'd be little point when RISC-V (or even AArch64) already exists.

Likely never, LLVM IR uses SSA form. This means that optimisations are easier, but the "assembly" is significantly higher lever than assembly a la MIPS. IR is for doing optimisations not executing code (although LLVM does have interpreters if that's what ya need)
A more interesting question is, when will we see operating systems using LLVM IR (or similar; some future version of WebAssembly, perhaps?) for binaries on disk, dynamically compiling them and caching the result for the current platform as needed.
In principle that could happen, but LLVM IR is really not designed for anything other than being transformed by LLVM. One could define an abstract risc machine, to be jitted at either side. LLVM is not quite suitable for this purpose: It assumes quite a few details about the target. Also, this requires a huge amount of co-operation. So far this has only happened in the browser with e.g. ECMAScript standardization, asm.js and WebAsm. The JVM tried to do this, but it's not a good compilation target for languages like C/C++. Therefore, I think will happen eventually: The web browsers will develop the tools and specifications to make this stuff, then it will get broken off and used outside of the web (I hope, god forbid all software has to be distributed via the web using overlyHypedWebScale.js v2)
iOS 9, ChromeOS PNaCL?

Although it is not really what you are describing.

BTW, GCJ was recently finally removed from the GCC code base. I think it hadn't been maintained in a while.
LLVM IR is a generalized assembly language whereas JVM byte code is quite specific to the Java language, i.e., it deals with objects and classes. This causes all kinds of troubles for someone wanting to translate other languages to the JVM.