| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cyberferret 3417 days ago

Very early on in my computing career (nearly 40 years ago now), I remember being blown away when an older IBM Systems 360 programmer showed me how you can swap the values of two variables over WITHOUT using a third placeholder variable by using pure XOR.

I didn't believe him until he showed me. Apparently they used to use it all the time to swap out entire segments of RAM in the S/360 without having to page out to disk or clobber other free RAM segments. It is simply:

  a = a xor b
  b = b xor a
  a = a xor b

Three steps, same as using a placeholder 'c' variable. I think he mentioned on most of the processors of the time, 3 XOR instructions actually worked faster than 3 MOV instructions.

Illustration for the non-believers:

  a = 10010110
  b = 01100011

  a = a xor b

  a = 11110101
  b = 01100011 (unchanged)

  b = b xor a

  a = 11110101 (unchanged)
  b = 10010110

  a = a xor b

  a = 01100011
  b = 10010110

Voila!

6 comments

onlydnaq 3417 days ago

This is a classic trick, and as you write could be used for performance benefits in the old day.

However, the semantics differ from just using a temporary variable in that if a and b are in the same memory location then the result will be zero.

This was used in an entry for the underhanded C contest [1] if I remember correctly where for an implementation of RC4 the author defined the following macro.

    #define SWAP(x, y) do { x^=y; y^=x; x^=y; }

And used it for swapping the values in the substitution table for the cipher, e.g. SWAP(S[i], S[j]). The weakness was that since sometimes the indices are the same in RC4 the substitution table would be gradually replaced with zeroes.

[1] http://www.underhanded-c.org/_page_id_16.html

eutectic 3417 days ago

I like to think about this algebraically:

    a1 = a0 ^ b0
    b1 = b0 ^ a1 = b0 ^ (a0 ^ b0) = (a0 ^ b0) ^ b0 = a0
    a2 = a1 ^ b1 = (a0 ^ b0) ^ a0 = (b0 ^ a0) ^ a0 = b0

The xor-linked list (https://en.wikipedia.org/wiki/XOR_linked_list) is another neat idea along the same theme.

simias 3417 days ago

I have a similar memory from my early days of coding.

I was implementing a toy rc4 cipher. One of the steps in the ciphers involves swapping entries in an arrays of 256 elements. I thought "hey, I'm a 1337 coder, I'm going to use the xor trick".

Except it doesn't work if you're trying to swap something with itself. If you do "a[i] ^= a[j]" and i == j then you're just clearing the entry.

Taught me the valuable lesson that I shouldn't try to be a smartass while writing code and the importance of unit tests.

taejo 3417 days ago

This was the basis of an excellent entry [0] in the 2007 Underhanded C Contest. It had a correct implementation of the RC4 encryption algorithm, except it used the XOR swap, so on average one byte of the pseudorandom state was zeroed every 256 iterations. Eventually, the state is all zeroes, and the encryption just outputs pure plaintext. Best of all, the first few kilobytes of output looks random at first glance.

[0] http://www.underhanded-c.org/_page_id_16.html

cyberferret 3417 days ago

Nice catch - I hadn't though about in-place swap situations. I have always used this trick to switch two variables, but yes, I can see when you are talking about lists and matrix array manipulation, you can easily encounter an edge case which necessitates an 'in place' swap which would fail under these circumstances.

amelius 3417 days ago

You can do the same with addition/subtract. Perhaps there is a simpler way, but here is one way to do it:

    a = a + b
    b = b + a
    a = b - a
    b = b - 2*a

mcintyre1994 3417 days ago

a = a + b

b = a - b

a = a - b

I think this works too, assuming a+b doesn't overflow.

amelius 3417 days ago

Good one.

> I think this works too, assuming a+b doesn't overflow.

Well, in two's complement arithmetic (as is used on most architectures), the intermediate overflow can be ignored, and it will work just fine.

Sacho 3417 days ago

You have to take overflow into account. In most cases, that would make the algorithm not very useful compared to just doing the swap with a third variable.

amelius 3417 days ago

In two's complement arithmetic, you are basically computing modulo N (N = 2 to the power of the number of bits). So overflow will not interfere with the swap operation.

JoeAltmaier 3417 days ago

Hm I don't think overflow will apply. Anything it does in an add will be undone by a subtract?

Except that 2a, that might be trouble.

mcintyre1994 3417 days ago

Imagine you only had 4 bit numbers (range 0-15) and you tried to do swap 14, 15 (1110, 1111). You can do that with xor but not with the add method, because you can't store a + b without a wider variable.

jonsen 3417 days ago

15 + 15 gives 14

14 - 15 gives 15

mcintyre1994 3417 days ago

The way I wrote this was rubbish, edited. I was trying to get at if you have 1110, 1111 then 1110 + 1111 will overflow, but you could xor them.

wolfgke 3417 days ago

This also works with `sub` instead of xor. This code (xor swap, sub swap) should nevertheless better not be used on a modern CPU since it can badly be pipelined.

Khoth 3417 days ago

Yes, in fact compilers these days are smart enough to convert people's xor swaps into mov swaps: https://godbolt.org/g/FYv7xQ

wolfgke 3417 days ago

When I look at the generated code that multiple of the compilers (gcc,clang,icc) generate

  mov     eax, edi
  mov     edi, esi
  mov     esi, eax

I would intuitively use the `xchg` instruction that x86-32/x86-64 provides instead. Is there a specific reason why the compiler(s) decide to generate the mentioned code instead?

Someone 3417 days ago

When used to swap with data in memory, the reason is that it is faster. xchg is atomic. To do that, it implicitly locks its target address. That makes it slower than the series of moves. http://www.agner.org/optimize/instruction_tables.pdf:

"Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor systems. This also applies to the XCHG instruction with a memory operand."

wolfgke 3417 days ago

I know that. But this is only relevant if you exchange registers with memory and is not of relevance if you exchange two registers. I accept that this is a good point if some variables are moved to the stack because of register spilling or because you want to use the address of the variable (which is not the case here).

So I still stand by my point: What is the reason why the compiler uses `mov` for exchanging two registers here instead of `xchg`?

Someone 3417 days ago

I think that's because (at least on some CPUs) it takes three macro-operations. http://www.agner.org/optimize/microarchitecture.pdf (section 17.4, page 188):

"Vector path instructions are less efficient than single or double instructions because they require exclusive access to the decoders and pipelines and do not always reorder optimally. For example:

    ; Example 17.1. AMD instruction breakdown
    xchg  eax, ebx   ; Vector path, 3 ops
    nop              ; Direct path, 1 op
    xchg  ecx, edx   ; Vector path, 3 ops
    nop              ; Direct path, 1 op

This sequence takes 4 clock cycles to decode because the vector path instructions must decode alone."

JoeAltmaier 3417 days ago

Some of that is weird placeholders for the debugger, for inserting hook instructions or whatnot?

eutectic 3417 days ago

I think you mean a combination of sub and add (e.g. sub sub add). xor is somewhat special in that it is its own inverse.

wolfgke 3417 days ago

Yes, you are right - I was a little abentminded.

louthy 3417 days ago

Yep, that's the 'classic' usage for me. Used to use this all the time if registers were short (and they always were).