Hacker News new | ask | show | jobs
by adrian_b 876 days ago
While in this case the jump into the middle of an instruction was not intentional, several decades ago many Microsoft programs for MS-DOS, like their BASIC interpreter, contained frequent intentional jumps into the middle of instructions.

This technique had already been used in many Microsoft programs for CP/M, i.e. for Intel 8080 or Zilog Z80 CPUs.

There were two reasons for the use of such jumps in the middle of instructions. One was to obfuscate the program for disassemblers, which were desynchronized with the instruction stream by such jumps. The other reason is that each such jump usually saved one byte of instruction code in comparison with conventional conditional program structures.

When the program size was still measured in kilobytes, saving even only one hundred bytes by this method was important.

The instruction into the middle of which a jump was performed was usually an instruction that loaded a 16-bit register with an immediate value that happened to be the encoding for some other instruction.

The value loaded into the register was not used. Therefore, when the instruction was entered through its start address, no operation was done, but when it was entered through a jump, some extra instruction was executed.

A typical use case would be at the beginning of a loop, which was entered after initializing a control variable, while whenever the loop was repeated later an extra instruction was executed (corresponding to the immediate value of the register load), to update the control variable. This saved one byte upon the standard loop implementation that uses an unconditional jump for entering the body of the loop in its middle, after the loop initialization.

2 comments

Biology sometimes "jumps into the middle of instructions" when decoding DNA/RNA. The "instruction" length is always 3 nucleotides and the sequence usually has only one valid "entry point".

There are exceptions though. Some sequences do encode biologically useful information with sequences which not only overlap but do so with the starting points differing by a number not divisible by 3: https://en.wikipedia.org/wiki/Reading_frame#Multiple_reading...

You'd almost hope there was some younger more code-centric version of Douglas Hofstadter, figuring out how to write palindromic or multi-framed assembler that was semantically equivalent to, for example, his "Crab Canon".
Rather than "jumping into the middle of an instruction" (which to me implies some crazy trickery where both the original instruction and the middle of it do something useful, like the sibling comment about biology), a more straightforward way of looking at it is that the "jumped-into" instruction is actually a single-byte jump which skips over the one or two following bytes - and those shouldn't be considered to be part of the instruction at all.

The opcode will of course perform whatever function it was normally intended for, but it will have been chosen by the programmer to be something that doesn't affect the state of the program in a meaningful way. So it really is just a shorter form of jump, not something beyond the understanding of mere humans - you could always replace it with a regular jump and get the same semantics.

CPUs at the time didn't have branch prediction. The 8086 did prefetch code, but always linearly, and even an unconditional jump would flush that prefetch queue. So having an instruction like "CMP AX,xxxx" (where xxxx=any 16 bit immediate value) would be both shorter and faster than the "proper" jump instruction, while affecting only the flag register.

And on the 8080, there was no short (8 bit relative) jump instruction, so this kind of trick even saved two bytes instead of just one.

> Rather than "jumping into the middle of an instruction" (which to me implies some crazy trickery where both the original instruction and the middle of it do something useful, like the sibling comment about biology)

Doesn't "jumping into the middle of the instruction" just mean that decoding from the middle is another instruction, then when decoding from the start of it?

E.g.:

    IMUL -> 0x0FAF 
    SCASW -> 0xAF
Almost any random bytes can be decoded as some instruction - most of them are not useful. The essence of this jump-forward-by-one-or-two-bytes trick is that it performs an operation which doesn't have an effect on the program, so it can be considered to be a single-byte jump opcode, and its "immediate operand" as a separate instruction which is simply skipped over.

You could define assembler macros named something like "SKIP1" and "SKIP2" that emit these opcode bytes (3Ch / 3Dh), and use them just like a jump, provided that you don't care about the flags.

My point was that this is easier to reason about than the more general case where (working) code jumps into the middle of an instruction that already does something else - that either takes extreme cleverness, or evolution over millions of years.