Hacker News new | ask | show | jobs
by Sweepi 58 days ago
"Bonus bonus chatter: The xor trick doesn’t work for Itanium because mathematical operations don’t reset the NaT bit. Fortunately, Itanium also has a dedicated zero register, so you don’t need this trick. You can just move zero into your desired destination."

Will remember for the next time I write asm for Itanium!

2 comments

Quite a few architectures have a dedicated 0 register.
Yep. The XOR trick - relying on special use of opcode rather than special register - is probably related to limited number of (general purpose) registers in typical '70 era CPU design (8080, 6502, Z80, 8086).
Unfortunately, 6502 can't XOR the accumulator with itself. I don't recall if the Z80 can, and loading an immediate 0 would be most efficient on those anyway.
XOR A absolutely works on Z80 and it's of course faster and shorter than loading a zero value with LD A,0. LD A,0 is encoded to 2 bytes while XOR A is encoded as a single opcode. XOR A has the additional benefit to also clear all the flags to 0. Sub A will clear the accumulator, but it will always set the N flag on Z80.
Yeah, the article seems to have missed the likely biggest reason that this is the popular x86 idiom - that it was already the popular 8080/Z80 idiom from the CP/M era, and there's a direct line (and a bunch of early 8086 DOS applications were mechanically translated assembly code, so while they are "different" architectures they're still solidly related.)
Ah, thanks, I couldn't recall off the top of my head.
should set Z too
You're absolutely right, I stand corrected.

The 6502 gets by doing immediate load: 2 clock cycles, 2 bytes (frequently followed by single byte register transfer instruction). Out of curiosity I did a quick scan of the MOS 1.20 rom of the BBC micro:

  LDY #0 (a0 00): 38 hits
  LDX #0 (a2 00): 28 hits
  LDA #0 (a9 00): 48 hits
Are you sure you're not an LLM? There is no way anybody writing 6502 would do anything else, because there's no other way to do it.

(You can squeeze in a cheeky Txx instruction afterwards to get a 2-or-more-for-1, if that would be what you need - but this only saves bytes. Every instruction on the 6502 takes 2+ cycles! You could have done repeated immediate loads. The cycle count would be the same and the code would be more general.)

> Are you sure you're not an LLM?

Hard to tell, but I don't think so ;-)

I suppose using Txx instructions rather than LDx is more of an idiom than intended to conserve space. Also, could an LDx #0 potentially be 3 cycles in the edge case where the PC crosses a page boundary? (I'm probably confused? Red herring?)

The Z80 can do either LD A,0 or SUB A or XOR A, but the LD is slower due to the extra memory cycle to load the second byte of the instruction.
And [as mentioned in the article] even modern x86 implementations have a zero register. So you have this weird special opcode that (when called with identical source and destination) only triggers register renaming
A move on SPARC is technically an OR of the source with the zero register. "move %l0, %l1" is assembled as "or %g0, %l0, %l1". So if you want to zero a register you OR %g0 with itself.
Indeed!!

MIPS - $zero

RISC-V - x0

SPARC - %g0

ARM64 - XZR

PowerPC: "r0 occasionally" (with certain instructions like addi, though this might be better considered an edge case of encoding)
On 64-bit ARM, the same register number is XZR in some instructions and the stack pointer in others.
Alpha: r31, f31
Very few architectures have a NAT bit though.
indeed. riscv for instance. also, afaik, xor’ing is faster. i would assume that someone like mr. raymond would know…
> afaik, xor’ing is faster

Even tiny tiny CPUs can do sub in one cycle, so I doubt that. On super-scalar CPUs xor and sub are normally issued to the same execution units so it wouldn't make a difference there either.

On superscalars running xor trick as is would be significantly slower because it implies a data dependency where there isn't one. But all OOO x86's optimize it away internally.
Sub has the same false data dependency.
Which part of "mathematical operations don’t reset the NaT bit" did you not understand?
It would probably run really fast, considering that Itanium's downfall was the difficulty in compiling. (Including translating x86 instructions into Itanium instructions)
Not really. Itanium was a result of some people at Intel being obsessed by LINPACK benchmarks and forgetting everything else. It sucked for random memory access, and hence everything that's not floating-point number-crunching. Compiler can't hide memory access latency because it's fundamentally unpredictable. VLIW does magic for floating-point latency (which is predictable), but

- As transistors got smaller, FP performance increased, memory latency stayed the same (or even increased).

- If you are doing a lot of floating point, you are probably doing array processing, so might as well go for a GPU or at least SIMD).

- Low instruction density is bad for I-cache. Yes, RISC fans, density matters! And VLIW is an absolute disaster in that regard. Again, this is less visible in number-crunching loads where the processor executes relatively small loops many times over.

Naive question: shouldn't vliw be beneficial to memory access, since each instruction does quite a lot of work, thus giving the memory time to fetch the next instruction?
- Even each instruction does a lot of work, it is supposed to do it in parallel, so time available to fetch the next instruction is (supposed to be) the same.

- Not everything is parallelisable so most of instructions words end up full of NOPs.

- The real problem are data reads. Instruction fetches are fairly predictable (and when they aren't OOO suck just as much), data reads aren't. An OOO can do something else until the data comes in. VLIV, or any in-order architecture, must stall as soon as a new instruction depends on the result of the read.

Short loops and lots of math is what I'm talking about; the kind of thing that hand-written assembly language helps with on even modern hardware.