| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by eb0la 201 days ago

I remember a lot of code zeroing registrers, dating at least back from the IBM PC XT days (before the 80286).

If you decode the instruction, it makes sense to use XOR:

- mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs 3 bytes (66 31 c0)

This extra byte in a machine with less than 1 Megabyte of memory did id matter.

In 386 processors it was also - mov eax,0 - needs 5 bytes (b8 00 00 00 00) - xor eax,eax - needs 2 bytes (31 c0)

Here Intel made the decision to use only 2 bytes. I bet this helps both the instruction decoder and (of course) saves more memory than the old 8086 instruction.

6 comments

Sharlin 201 days ago

As the author says, a couple of extra bytes still matter, perhaps more than 20ish years ago. There are vast amounts of RAM, sure, but it's glacially slow, and there's only a few tens of kBs of L1 instruction cache.

Never mind the fact that, as the author also mentions, the xor idiom takes essentially zero cycles to execute because nothing actually happens besides assigning a new pre-zeroed physical register to the logical register name early on in the pipeline, after which the instruction is retired.

umanwizard 201 days ago

> nothing actually happens besides assigning a new pre-zeroed physical register to the logical register name early on in the pipeline, after which the instruction is retired.

This is slightly inaccurate -- instructions retire in order, so it doesn't necessarily retire immediately after it's decoded and the new zeroed register is assigned. It has to sit in the reorder buffer waiting until all the instructions ahead of it are retired as well.

Thus in workloads where reorder buffer size is a bottleneck, it could contribute to that. However I doubt this describes most workloads.

Sharlin 201 days ago

Thanks, that makes sense.

cogman10 201 days ago

L1 instruction cache is backed by L2 and L3 caches.

For the AMD 9950, we are talking about 1280kb of L1 (per core). 16MB of L2 (per core) and 64MB of L3 (shared, 128 if you have the X3D version).

I won't say it doesn't matter, but it doesn't matter as much as it once did. CPU caches have gotten huge while the instructions remain the same size.

The more important part, at this point, is it's idiomatic. That means hardware designers are much more likely to put in specialty logic to make sure it's fast. It's a common enough operation to deserve it's own special cases. You can fit a lot of 8 byte instructions into 1280kb of memory. And as it turns out, it's pretty common for applications to spend a lot of their time in small chunks of instructions. The slow part of a lot of code will be that `for loop` with the 30 AVX instructions doing magic. That's why you'll often see compilers burn `NOP` instructions to align a loop. That's to avoid splitting a cache line.

Sharlin 201 days ago

> For the AMD 9950, we are talking about 1280kb of L1 (per core). 16MB of L2 (per core)

Ryzen 9 CPUs have 1280kB of L1 in total. 80kB (48+32) per core, and the 9 series is the first in the entire history of Ryzens to have some other number than 64 (32+32) kilobytes of L1 per core. The 16MB L2 figure is also total. 1MB per core, same as the 7 series. AMD obviously touts the total, not per-core, amounts in their marketing materials because it looks more impressive.

monocasa 201 days ago

Yeah, the reason for that is that it's expensive in PPA for the size of an L1 cache to exceed number of ways times page size. The jump to 48kB was also a jump to 12 way set associative.

As an aside, zen 1 did actually have a 64kB (and only 4 way!) L1I cache, but changed to the page size times way count restriction with zen 2, reducing the L1 size by half.

You can also see this on the apple side, where their giant 192kB caches L1I are 12 ways with a 16kB page size.

kbolino 201 days ago

Also, rather importantly, the L1i (instruction) cache is still only 32 kB. The part that got bigger, the 48 kB of L1d (data) cache, does not count for this purpose.

gpderetta 201 days ago

Instruction caches also prefetch very well, as long as branch prediction is good. Of course on a misprediction you might also suffer a cache miss in addition to the normal penalty.

vardump 201 days ago

> - mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs 3 bytes (66 31 c0)

You don't need operand size prefix 0x66 when running 16 bit code in Real Mode. So "mov ax, 0" is 3 bytes and "xor ax, ax" is just 2 bytes.

eb0la 201 days ago

My fault: I just compiled the instruction with an assembler instead of looking up the actual instruction from documentation.

It makes much more sense: resetting ax, and bc (xor ax,ax ; xor bx,bx) will be 4 octets, DWORD aligned, and a bit faster to fetch by the x86 than the 3-octet version I wrote before.

Someone 201 days ago

> If you decode the instruction, it makes sense to use XOR:

> - mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs 3 bytes (66 31 c0)

Except, apparently, on the pentium Pro, according to this comment: https://randomascii.wordpress.com/2012/12/29/the-surprising-..., which says:

“But there was at least one out-of-order design that did not recognize xor reg, reg as a special case: the Pentium Pro. The Intel Optimization manuals for the Pentium Pro recommended “mov” to zero a register.”

qingcharles 200 days ago

That's weird, I looked it up earlier and found the P6 (Pentium Pro) was the first to actually make the xor optimization into a zero clock operation.

https://fanael.github.io/archives/topic-microarchitecture-ar...

Someone 200 days ago

A few paragraphs down from that:

“I assume that the ability to recognize that the exclusive-or zeroing idiom doesn't really depend on the previous value of a register, so that it can be dispatched immediately without waiting for the old value — thus breaking the dependency chain — met the same fate; the Pentium Pro shipped without it.

Some of the cut features were introduced in later models: segment register renaming, for example, was added back in the Pentium II. Maybe dependency-breaking zeroing XOR was added in later P6 models too? After all, it seems such a simple yet important thing, and indeed, I remember seeing people claim that's the case in some old forum posts and mailing list messages. On the other hand, some sources, such as Agner Fog's optimization manuals say that not only it was never present in any of the P6 processors, it was also missing in Pentium M.”

Anarch157a 201 days ago

I don't know enough of the 8086 so I don't know if this works the same, but on the Z80 (which means it was probably true for the 8080 too), XOR A would also clear pretty much all bits on the flag register, meaning the flags would be in a known state before doing something that could affect them.

vanderZwan 201 days ago

Which I guess is the same reason why modern Intel CPU pipelines can rely on it for pipelining.

RHSeeger 201 days ago

> the IBM PC XT days (before the 80286)

Fun fact - the IBM PC XT also came in a 286 model (the XT 286).

eb0la 201 days ago

You're right. I forgot that!

RHSeeger 200 days ago

To be fair, I only remember because that was the 2nd computer I owned.

chasd00 201 days ago

> - mov ax, 0 - needs 4 bytes (66 b8 00 00) - xor ax,ax - needs 3 bytes (66 31 c0)

iirc doesn't word alignment matter? I have no idea if this is how the IBM PC XT was aligned but if you had 4 byte words then it doesn't matter if you save a byte with xor because you wouldn't be able to use it for anything else anyway. again, iirc.

Narishma 200 days ago

No, the 8088 used in the PC has a 2 byte word size. More importantly, it only has an 8-bit data bus, so alignment didn't really matter because it fetched instructions one byte at a time.