Hacker News new | ask | show | jobs
by lacampbell 3482 days ago
I would love to see the code used for some of these instructions. Their description of "swap" for the stack VM jumped out at me:

> pop 1 st and 2nd value out of local stack and swap them

You don't need to pop at all for swap. The stack isn't actually changing size and can be modified in place. You also should only have one pop for 'add', etc. A lot of people don't seem to realise this.

Also, this article may be of interest:

https://blogs.msdn.microsoft.com/ericlippert/2011/11/28/why-...

TL;DR - a stack machine is conceptually simpler than a register machine, and it doesn't matter if it's slower since you are JIT'ing it anyway.

5 comments

It does matter even for jitting.

Pike and Winterbottom (http://www.vitanuova.com/inferno/papers/hotchips.html) used register-based VM for Plan9/Inferno because it's much faster and easier to write a high-quality jit from register-based VM to register-based CPU (and all of them are register based) than from byte-code VM to register-based CPU.

This is probably the same reason Android chose register-based VM design for their Java-based platform.

Jitting is not free - an increase in jit complexity is paid by lower performance of the code being jitted. Java's HotSpot does wonders but it takes much latency and high memory use to get the really high-quality results.

It makes sense to do as much work as possible where you have time and CPU cycles to spare (during compilation to VM instruction set) and not on the device when every millisecond and every byte matters.

Lower complexity of stack-based VM is a moot point - when you're jitting, you're doing the hard stuff anyway.

I don't follow the reasoning. Tracking the stack depth to convert into the equivalent of Pike and Winterbottom's VM registers takes at most one instruction at JIT time per stack-VM instruction (to increment or decrement the depth). If you want distinguish different types at the same stack offset, that might be one more instruction. Pike and Winterbottom seem to be saying their VM has an infinite number of registers, so the frontend is simple, and then the JIT backend has a choice of how hard to work at register allocation. But the same goes for a stack machine, with a very small overhead at JIT time, nothing like the expense of HotSpot.

The argument that does make sense to me is, if you want a somewhat faster interpreter at some expense in code size, then a register VM wins. (Perhaps by not as much as commonly thought, because of some optimizations like stack caching and superinstructions: http://savannah.gnu.org/projects/vmgen/ But it does seem to win still.) This has the disadvantage of having to having to implement spilling or some other such complication in the frontend when the number of virtual registers needed exceeds 256 or whatever.

Well we are both just appealing to experts here (unless you are infact one. I am certainly not - just an enthusiast).

What did you think of the last part of Erik Lipperts post?

The code is much smaller and much easier to understand. A stack machine is a very simple way to describe a complex computation; by being able to write code for such a simple machine, it lowers the cost of making a compiler. And not only is it easier to write compilers and jitters that target simple stack machines, it is easier to write other code analysis tools as well. The IL verifier, for example, can quickly determine when there is a code path through a method that, say, misaligns the stack, or passes the wrong types of arguments to a method.

The stack machine may be conceptually simple, but here's the thing:

Either you have a way to address an offset into the stack, or you don't. If you do, the stack is conceptually equivalent to a register file where you can optionally swap out the entire content in one go (by changing the stack pointer).

If you don't, then registers will in most designs offer you a strict superset of operations: I've yet to see any register-based machines where you can't load things into registers, do the operation and push them back on the stack. In fact, on most machines that are not purist RISC designs, you will tend to be able to do many/most operations with one of the operands in memory.

That superset comes at a complexity cost in some areas, sure, but being able to address more different data items without having to manipulate the stack also turns out to be very handy. If it wasn't, we'd just stick to mostly manipulating the stack even in register-based designs.

EDIT: In fact, you'll often see "naive" first code generators for register-based machines heavily rely on stack manipulation for expression evaluation because it massively reduces the need for register allocation complexity.

Also, it's worth noting that computation stack and execution stack need not be the same thing (and, indeed, often aren't).

In CLR, for example, local variables are referenced by index on the execution stack, but you cannot dynamically push new variables onto it - you have to predeclare the locals at the beginning of your execution frame, and the only way to push a new execution frame is by calling another function. On the other hand, opcodes like ADD operate on the computation stack, which is an abstraction completely separate from, and not affected by, the execution stack (every execution frame has its own independent computation stack; there's no way to peek across the boundary).

So in reality it's a bit more complicated - those locals are also kinda like named (or indexed) registers (at least until you take their address). Then the stack adds more temp registers on top of that. In this model, you could always take any register-based code, and translate it to stack-based with a very blunt transform, where something like "ADD x, y, z" becomes "LOAD x; LOAD y; ADD; STORE z". But in addition to that, you also have the flexibility of doing a bunch of operations directly on the temp stack, without intervening stores and loads, that are unavoidable in the register representation. So I would argue that it's the stack representation that is a superset of register one, not the other way around.

An even shorter way to say this is that stack-based VMs are more high-level. With a register-based VM, you shift some important questions (like register allocation, and handling temporaries) to the compilers. With a stack-based VM, the compiler can be simpler, but the JIT will need to be correspondingly more complicated.

Which one to choose depends on how it is all used. For .NET, multi-language targeting was seen as an important design goal, and so making the bytecode easier to target was prioritized, with all the complexity relegated to JIT.

Long-term, I think it was a right solution in general, because most platforms seem to be moving from JIT to AOT; and with AOT, having the complexity in the bytecode-to-native compiler still has the upside of DRY, and slower compile times stop being a downside.

However, if you leave register allocation to the source->VM compiler and don't do it in the VM->machine compiler (JIT), you cannot perform machine-specific register allocation (8 registers on IA-32, 16 on AMD64 and ARM, 32 on Aarch64). So register-based VMs typically work with large register sets, and JITs allocate them to the smaller register sets of real machines.

I have not read the present paper yet, but earlier papers I have read were about reducing interpretation overhead by performing fewer VM instructions (if VM instruction dispatch is expensive; dynamic superinstructions make it cheap, but not everyone implements them).

Note that the JIT mechanism mentioned in the paper meant basically JIT from a program code to the bytecode executed in the VM, and directly executing it (very much like PHP), not necessarily directly to machine assembly.
Interesting.... the paper referenced you if you're the correct Anton Ertl :)
Interesting expressions usually use local variables. JVM, CLR etc. despite being nominally stack-oriented, use local variable slots by index rather than shuffling the stack for anything much more complicated than eager expression evaluation.

This gets the best of both worlds in practice.

The last page of the publication has a GitHub link. Here are the relevant lines (I think):

https://github.com/Conceptual-Inertia/Conceptum/blob/master/...

Yeah, really naive. Ditto with their iadd commands, which pop twice (you need only pop once). I see this a lot - presumably because most people just aren't familiar with stack machines, as register machines are completely dominant in hardware.

See https://news.ycombinator.com/item?id=13155259 for an example of what I mean.

> TL;DR - a stack machine is conceptually simpler than a register machine, and it doesn't matter if it's slower since you are JIT'ing it anyway

It totally matters. See [1] for previous work on this topic. You need to perform expensive data flow optimisations to generate information the stack format doesn't need to encode, but that a register format does need. Better to do this analysis ahead of time in the compiler than at runtime.

[1] https://www.usenix.org/events/vee05/full_papers/p153-yunhe.p...

IIRC, strictly speaking stack machines are not allowed to modify values in place on the stack. Any decent JIT or interpreter is likely to recognize that it can optimize certain operations by performing them in place, but the bytecode/IL/whatever will always be "pop two values and swap them".
> IIRC, strictly speaking stack machines are not allowed to modify values in place on the stack.

I am curious as to where you heard this. Modifying the stack in place seems like an implementation detail. I mean even if swap wasn't a primitive (which again seems really far fetched), you are still going to have to modify part of the stack in place eventually. It's just in the naiive way you decrement the stack pointer twice then increment it twice, while in the way I outline you don't bother since the length of the stack is the same after the operation is completed.

Probably easier to write less and pseudo code more - I am assuming a downward growing stack here

    # First approach
    temp_a = pop(stack)
    temp_b = pop(stack)
    push(stack, temp_a)
    push(stack, temp_b)

    # second approach
    temp_a = stack[0]
    stack[0] = stack[1]
    stack[1] = temp_a
A stack is an abstract type having two operations: push(element) and pop(). The implementation can cheat till it's transparent to the user. To quote CLRS,

The INSERT operation on a stack is often called PUSH, and the DELETE operation, which does not take an element argument, is often called POP. These names are allusions to physical stacks, such as the spring-loaded stacks of plates used in cafeterias. The order in which plates are popped from the stack is the reverse of the order in which they were pushed onto the stack, since only the top plate is accessible.

"A stack" is an abstract type, but "The Stack" isn't.
"The Stack" of a virtual machine definitely is.
> I am curious as to where you heard this. Modifying the stack in place seems like an implementation detail.

You are correct, but as the other reply mentioned, when we're talking about abstract specifications a stack typically doesn't allow reads and writes to arbitrary indices, therefore a stack machine can't perform an in-place swap. The implementation will almost certainly optimize the operation as you're describing, but I'd still expect to see a swap operation defined as "pop pop push push". Likewise I would usually expect to see an add described as popping two arguments and pushing one result, even thought the implementation probably pops one argument, then reads the second and overwrites it with the result.

In short, it's the abstract description of the thing vs the actual implementation.

I don't see why the abstract specification should mention pushing or popping at all. The abstract specification for swap is simply "swaps the first and second elements of the stack". How that's achieved doesn't seem important, and it seems odd to me to define it in terms of primitive operations it's not actually defined in terms of - unless it's some ultra minimalist Forth thing where swap isn't a primitive.
I can only partially agree with this - having a stack exposed to modification would virtually make it not a stack.
> You don't need to pop at all for swap.

Your skepticism is justified: https://github.com/frjalex/Conceptum/blob/master/src/main.c#...

(note that this doesn't implement a swap by any means)