Hacker News new | ask | show | jobs
by lacampbell 3478 days ago
How do most stack based VM represent instructions? When I went through a phase of writing VMs, the approach I used was to represent everything with a single byte. if a "push" byte was encountered, it would scan ahead 8 more instructions and construct an int. So it had the overhead of having to decode and encode actual data, but the instruction stream was very dense.

I imagine in the real world, they use something like

    union instruction_packet {
        int64 value;
        int8[8] instructions;
    }
Which would be less space efficient (when you hit a push instruction you'd need to move to the next packet, even if it was the first element of the array) but gets rid of the decoding/encoding problem.
1 comments

The JVM uses an 8-bit opcode, of which 202 are defined. Some opcodes have implicit operands (e.g., iload_1, load the first local variable as integer on the stack or iconst_m1, push -1 on the stack), while others take usually an index that's usually 1 byte (or 2 byte for jump offsets)--although there's a wide instruction that makes the next instruction use 2-byte operands instead.

Instruction decoding for the entire JVM instruction set save tableswitch and lookupswitch amounted to about 50 lines of code, not counting the enum listing every opcode.

I more meant - what shape is in the instruction stream? Is it literally just a stream of individual bytes, that have to be decoded when pushing constants etc?
The Code attribute in the JVM spec consists of the number of local variable slots and the maximum stack depth, a byte array, and a collection of attributes to represent things like the stack map or exception handling tables.

Most of the opcodes have zero or one operands, usually a single byte (or two bytes with a prefix) that's an index into the local variable array or the constant pool. There's several ways to load constants: the integer constants -1 through 5, long constants 0 and 1, float and double 0, 1, and 2 all have 0-operand opcodes (e.g., iconst_0); there's an opcode to load a 1-byte immediate, another to load a two-byte immediate, and an opcode to load a constant in the constant pool (which can be a string, a Class<?> reference, an integer, long, double, or float).

It should be noted that the JVM specifies a fixed big-endian format in its bytecode, and as a result, even the two-byte immediates are specified in the manual as two operands of a single byte each.

That's one possible design, but it's common to use a constant pool. Placing constants in the instruction stream is possible but awkward, since there is no guarantee that they are either aligned or of the right endianness (assuming you are trying to write a more or less portable system).

Constants that fit in a byte don't have this problem and can be easily placed in the instruction stream using dedicated byte constant instructions, sometimes including dedicated single-byte instructions for common constants like zero, one, etc.