Hacker News new | ask | show | jobs
by roninb 3387 days ago
The way I understand it, in the near future, the folk who write languages/compilers/build toolkits will be writing down to LLVM's Intermediate Representation rather than down to some specific machine code for some specific chipset.

Then the folk whom build the processors/architecture/assembly languages (and open source advocates when businesses ultimately ignore LLVM for a bit) will be writing the conversion from IR down to some specific machine code. This allows Intel to convert the IR line "COMPLICATEDFUNCTION $r2 5" to some advanced x86 call that has some significant speed or memory increase and is only one line while TI can still call the 26 lines of MIPS they need to call and both will be semantically equivalent.

That way you can, from the the software side, be relatively agnostic with how you're writing code and the chipset side is able to get every ounce of optimization using advanced functionality (if available). More importantly, software side is then able to ignore any processor features (be they speed ups or slowdowns) because every chipset manufacturer should have some toolset to convert your IR down to the best machine code they've got based on some desires (speed, space, power, etc). Inevitably, there will still be differences in performance between chipsets but being able to build down to IR (even with performance issues for some chipsets while everyone gets on board) and not some large set of assembly languages should be very nice.

2 comments

Just as additional information, this is just the future catching up with the past.

Most mainframe architectures since the early 60's, didn't really used pure Assembly, rather bytecodes that were processed by microcode on the CPUs.

Hence why many old papers tend to mix both terms, bytecodes vs assembly.

This tradition carried on to mainframe architectures like the AS/400, nowadays IBM i, where the user space is pure bytecode (even C), called TIMI, and a kernel level JIT is used at installation time to convert the application into the actual machine code.

IBM i Java takes advantage of this, where the JVM bytecodes are converted into TIMI bytecodes.

It also provides some kind of common language runtime called ILE (Integrated Language Environment).

So the trend of using LLVM bitcode on iDevices, Dalvik on Android or MSIL on .NET, JavaScript/WebAssembly on browsers, is similarly with containers, modern computing catching up with mainframe ideas.

IBM i Java used to run on top of TIMI. But the runtime overhead and IBM developer overhead were too high, and now it runs on top of PASE, which implements the AIX runtime without involving TIMI.
I wasn't aware of the change, thanks for the correction.
The change is a bit disappointing, in a way. Though I'm not sad to see the last of CRTJVAPGM.
> Just as additional information, this is just the future catching up with the past.

> Most mainframe architectures since the early 60's, didn't really used pure Assembly, rather bytecodes that were processed by microcode on the CPUs.

And was that microcode typed SSA-based IR with alias metadata and so forth?

You can't just lump all "bytecodes" together. x86 machine code is a bytecode that is processed by the microcode on the CPUs too.

I bet if anyone goes dig into SIGPLAN pappers there might be one or two examples there.

Even if not, I see it as a very interesting Msc or PhD to implement a FPGA to execute LLVM bitcode.

I am not the one lumping all bytecodes together CS papers since the dawn of our industry are. Just go dig a random paper about computer architectures in the 50 and 60's, for example.

Some of them even use bytecode and assembly interchangeably on the same document.

I guess we could eventually consider it a form of bytecode, given its CISC nature.

And when bytecode gets a pure hardware implementation, without any form of microcode support, is it still bytecode or has it become assembly?

What differentiates LLVM IR from, say, JVM bytecode? I'm curious because there's a stalled out GNU project under GCC called GCJ that would compile JVM bytecode to native. I wonder if the issue became that statically linking in the JVM in the binary resulted in a lot of bloat, or something more intrinsic to the suitability of JVM bytecode as a platform-independent IR...
Off the top of my head:

* JVM bytecode is stack-based, whereas LLVM uses "infinite registers" in SSA form

* being in SSA form makes it convenient to consume in compiler passes, but comes with quirks that mean you don't really want to write in that style manually: the mind-bending phi instructions, definitions must dominate uses, simple ops like incrementing a variable really means creating a new variable, etc.

* JVM bytecode carries a lot of Java-level information, for instance if you have N classes with M methods each in source, you will typically find N classes with M methods in bytecode too. A lot of keywords in Java have an equivalent in bytecode (e.g. private, protected, public, switch, new...)

* in contrast, LLVM IR feels closer to C (it only knows about globals, arrays, structs and functions). It exposes lower level constructs like vector instructions, intrinsics like memcpy

* JVM bytecode is well specified: anyone armed with the pdf [1] can implement a full JVM. LLVM IR is somewhat loosely defined and evolves based on the needs of the various targets

* JVM bytecode is truly portable, whereas target ABI details leak into LLVM IR. A biggie is 32 bit vs 64 bit LLVM IR.

[1] https://docs.oracle.com/javase/specs/jvms/se8/jvms8.pdf

> JVM bytecode is truly portable, whereas target ABI details leak into LLVM IR. A biggie is 32 bit vs 64 bit LLVM IR.

From a few LLVM meetings and Apple's work on bitcode packaging, I think there is some work to make LLVM IR actually architecture independent.

PNaCl is noteworthy here too, insofar as its IR is very much based on LLVM IR (but portable and stable).
There are lots of Java AOT compilers to native code, namely most commercial JDKs for embedded development.

The biggest problem with GCJ was that most people left the project when OpenJDK was made available, and decided to work on it instead.

LLVM IR is an IR format, :). Basically LLVM is just a abstract RISC, whereas the JVM is a lot of that with a truck load of very high level instructions. One could implement these as a superset of LLVM, but that's not what LLVM is. You, mostly, can JIT LLVM IR and use it as a generic bytecode vm: but it's really designed for static copmilation.
When will we see processors with a subset of llvm ir in hardware ?
That would be a terrible idea. LLVM IR has an inefficient in-memory representation; every IR node is separately heap-allocated via new and chained together via pointers. This is probably a suboptimal design for the compiler, but it would go from suboptimal to disastrous if directly used for execution.
I don't think an implementation of LLVM IR for execution would require the same in-memory representation.
Exactly, my point was more that as we're having C-optimized processors and microcontrollers or even java or lisp based ones, maybe once there is many software readily compileable with llvm maybe architectures could be optimized for it (but not directly porting it, just having a tiny final step llvm based microcode. By example of course you can't have infinite registers as ssa. But it ca' influence your instruction set.
That's not the only reason why you wouldn't want to run LLVM IR directly (if it were possible). You still have the types, which are useless at runtime, and the unlimited register file to deal with.

You could make an ISA which is similar to LLVM IR, but there'd be little point when RISC-V (or even AArch64) already exists.

Likely never, LLVM IR uses SSA form. This means that optimisations are easier, but the "assembly" is significantly higher lever than assembly a la MIPS. IR is for doing optimisations not executing code (although LLVM does have interpreters if that's what ya need)
A more interesting question is, when will we see operating systems using LLVM IR (or similar; some future version of WebAssembly, perhaps?) for binaries on disk, dynamically compiling them and caching the result for the current platform as needed.
In principle that could happen, but LLVM IR is really not designed for anything other than being transformed by LLVM. One could define an abstract risc machine, to be jitted at either side. LLVM is not quite suitable for this purpose: It assumes quite a few details about the target. Also, this requires a huge amount of co-operation. So far this has only happened in the browser with e.g. ECMAScript standardization, asm.js and WebAsm. The JVM tried to do this, but it's not a good compilation target for languages like C/C++. Therefore, I think will happen eventually: The web browsers will develop the tools and specifications to make this stuff, then it will get broken off and used outside of the web (I hope, god forbid all software has to be distributed via the web using overlyHypedWebScale.js v2)
iOS 9, ChromeOS PNaCL?

Although it is not really what you are describing.

BTW, GCJ was recently finally removed from the GCC code base. I think it hadn't been maintained in a while.
LLVM IR is a generalized assembly language whereas JVM byte code is quite specific to the Java language, i.e., it deals with objects and classes. This causes all kinds of troubles for someone wanting to translate other languages to the JVM.