Hacker News new | ask | show | jobs
by _RPM 3551 days ago
Interesting. One thing that I still haven't solved yet is the "break" and "continue" statement inside loops. For a break statement, it seems like it would just be the same a JMP with an address as the operand, but there would need to be some sort of registration of "The VM is in the loop now, and the break address is X", and continue would also be a JMP with an address to the top of the code for the loop.

I haven't implemented those in my system yet, and also have no idea how Python or PHP does it.

Is PHP's VM a stack based one? I do read the Zend/ directory of PHP's source, but it is really hard to follow and there is virtually no documentation on the VM

1 comments

You can implement those as part of a linking phase during bytecode generation: emit a placeholder value (e.g. 0) for the jump address and when you've finished compiling the block go back and fill-in the placeholder with the correct address/offset.

That's relatively easy when implementing an assembler for your opcodes. Just keep track of symbolic labels and their associated jump points (as a simple array or linked list) and process (i.e. finalize or "link") the jump points when the label address becomes known. My "assemblers" often have constructs like:

  L0
  ...
  J1
  ...
  J0
  ...
  L1
where L? registers a symbolic jump destination (i.e. x.label[0].offset = label_offset) and J? emits an unconditional jump and registers a link request (i.e. push(x.label[1].from, jump_opcode_offset)). When a block is finished all the offsets are known; you just process things like

  for (i = 0; i < x.nlabel; i++) {
    for (j = 0; j < x.label[i].nfrom; j++) {
      patch_in_offset(x.label[i].from[j], x.label[i].offset)
    }
  }

Knowing when to emit symbolic label and jump instructions from the AST is a little more involved, but no more than analyzing the AST for anything else.

Supporting computed gotos would require much more bookkeeping, I'd imagine, and I'm not surprised few languages support that construct. Or maybe not... I haven't really thought it through.

One cool thing about this whole exercise is that it helps to demonstrate why generating some intermediate representation can be easier (conceptually and mechanically) than directly generating runnable code in a single pass. It seems more complex but it really makes things easier.

Computed gotos defer the search for the label from compile time to run-time. So you need to hash the targets, and then jump to it.

It all depends how your interpreter PC (program counter) works. Some have just opcode offsets (like a real PC, as %eip, e.g. lua, ruby, ..) some have opcode addresses (perl, lisp, ...).

With offsets you can encode jumps relative, which is a huge advantage. With addresses you have to remain absolute, which means the data needs a full pointer (cache), and you cannot move the code around for optimizations afterwards. With offsets you have a cache-friendly op-array, with addresses you have a fullblown tree (AST) or linked list, with its cache-unfriendly pointer chasing overhead.

But with full addresses you can avoid the big switch loop overhead in an interpreter, you just pass around the next pointer instead if incrementing the PC. But then it gets interesting how to keep the state of the stack depth.