Hacker News new | ask | show | jobs
by userbinator 4334 days ago
I wouldn't classify it as intentional nor a "bug"; probably it's more of an oversight, as it's mentioned in the article that AMD's CPUs don't have this issue. Intel should definitely be made aware of this.

We can only speculate, but it's likely that Intel has the same handling for a lot two-operand instructions. Common instructions like add, sub take two operands both of which are inputs. So Intel probably shoved popcnt into the same category to keep the processor design simple.

On the other hand, MOV doesn't read both operands either.

1 comments

Reg-Reg MOV doesn't use an ALU, though.

It would be interesting to see if the Intel C Compiler knows about this false dependency.

'icpc' (the Intel C++ compiler) has equal performance for both of the test cases, and it did choose to use different registers for each call. But it's not clear if that's by design or by chance. In some ways, that's the boring part. The interesting part (to me) is that both tests are much faster than either version with g++.

Here's icpc 14.0.3 vs g++ 4.8.1 on a Sandy Bridge E5-1620 @ 3.60GHz and a Haswell i7-4770 CPU @ 3.40GHz.

  nate@sandybridge:~/tmp$ g++ -O3 -march=native -std=c++11 popcnt-dependency.cpp -o popcnt-dependency
  nate@sandybridge:~/tmp$ popcnt-dependency 1
  unsigned	41959360000	0.608615 sec 	17.2289 GB/s
  uint64_t	41959360000	0.82312 sec 	12.739 GB/s
  nate@sandybridge:~/tmp$ icpc -O3 -march=native -std=c++11 popcnt-dependency.cpp -o popcnt-dependency
  nate@sandybridge:~/tmp$ popcnt-dependency 1
  unsigned	41959360000	0.182781 sec 	57.3679 GB/s
  uint64_t	41959360000	0.182638 sec 	57.4128 GB/s

  nate@haswell:~/tmp$ g++ -O3 -march=native -std=c++11 popcnt-dependency.cpp -o popcnt-dependency
  nate@haswell:~/tmp$ popcnt-dependency 1
  unsigned	41959360000	0.401225 sec 	26.1343 GB/s
  uint64_t	41959360000	0.75841 sec 	13.826 GB/s
  nate@haswell:~/tmp$ icpc -O3 -march=native -std=c++11  popcnt-dependency.cpp -o popcnt-dependency
  nate@haswell:~/tmp$ popcnt-dependency 1
  unsigned	41959360000	0.0843861 sec 	124.259 GB/s
  uint64_t	41959360000	0.0842836 sec 	124.41 GB/s
That would be incredible if true! But I think it's a bug, since the inner loop looks far too short and doesn't seem to be repeating the popcnt's. I'm not sure yet if it's a problem with the compiler or if the test case is abusing something undefined.
OK, it looks like 'icpc' has decided that it would be fastest to invert the two loops: popcnt() once, then repeat the addition 10000 times. I'm neither a language lawyer nor a friend of C++, so I'll refrain to trying to decide whether this is a legal optimization. But a liberal sprinkling of 'volatile' makes it do what was obviously intended. After this, the speeds are more comparable, although 'icpc' retains a small (but much more plausible) lead:

  nate@sandybridge:~/tmp$ popcnt-dependency 1
  unsigned	41959360000	0.517827 sec 	20.2495 GB/s
  uint64_t	41959360000	0.518041 sec 	20.2412 GB/s


  nate@haswell:~/tmp$ popcnt-dependency 1
  unsigned	41959360000	0.351273 sec 	29.8507 GB/s
  uint64_t	41959360000	0.352914 sec 	29.712 GB/s
The other test I did was checking what Intel's IACA (a wonderful optimization tool that you really should be using if you are not already) thought about the g++ loop. It did _not_ notice the false dependency, and said the loops should take the same amount of time. Do this suggest that the Intel compiler is just getting lucky, or that Intel doesn't have great internal communication between teams?
That's the problem with microbenchmarks, ensuring they're measuring what you think they're measuring.
It's definitely legal, the buffer is created with operator new and is not made visible to any other code, so the compiler knows that it can't change within the loop.