Yep. The XOR trick - relying on special use of opcode rather than special register - is probably related to limited number of (general purpose) registers in typical '70 era CPU design (8080, 6502, Z80, 8086).
Unfortunately, 6502 can't XOR the accumulator with itself. I don't recall if the Z80 can, and loading an immediate 0 would be most efficient on those anyway.
XOR A absolutely works on Z80 and it's of course faster and shorter than loading a zero value with LD A,0.
LD A,0 is encoded to 2 bytes while XOR A is encoded as a single opcode.
XOR A has the additional benefit to also clear all the flags to 0. Sub A will clear the accumulator, but it will always set the N flag on Z80.
Yeah, the article seems to have missed the likely biggest reason that this is the popular x86 idiom - that it was already the popular 8080/Z80 idiom from the CP/M era, and there's a direct line (and a bunch of early 8086 DOS applications were mechanically translated assembly code, so while they are "different" architectures they're still solidly related.)
The 6502 gets by doing immediate load: 2 clock cycles, 2 bytes (frequently followed by single byte register transfer instruction). Out of curiosity I did a quick scan of the MOS 1.20 rom of the BBC micro:
Are you sure you're not an LLM? There is no way anybody writing 6502 would do anything else, because there's no other way to do it.
(You can squeeze in a cheeky Txx instruction afterwards to get a 2-or-more-for-1, if that would be what you need - but this only saves bytes. Every instruction on the 6502 takes 2+ cycles! You could have done repeated immediate loads. The cycle count would be the same and the code would be more general.)
I suppose using Txx instructions rather than LDx is more of an idiom than intended to conserve space. Also, could an LDx #0 potentially be 3 cycles in the edge case where the PC crosses a page boundary? (I'm probably confused? Red herring?)
I don't know how the 6502's PC increment actually worked, but it was an exception to the general rule of page crossings (or the possibility thereof) incurring a penalty, or, as was also sometimes the case, just ignored entirely. (One big advantage of the latter approach: doing nothing does take 0 cycles.)
The full 16 bits would be incremented after each instruction byte fetched, and it didn't cost any extra if there was a carry out of the MSB.
And [as mentioned in the article] even modern x86 implementations have a zero register. So you have this weird special opcode that (when called with identical source and destination) only triggers register renaming
A move on SPARC is technically an OR of the source with the zero register. "move %l0, %l1" is assembled as "or %g0, %l0, %l1". So if you want to zero a register you OR %g0 with itself.