|
|
|
|
|
by dragontamer
60 days ago
|
|
The XOR trick is implemented as a (malloc from register file) on modern processors, implemented in the decoder and it won't even issue a uOp to the execution pipelines. Its basically free today. Of course, mov RAX, 0 is also free and does the same thing. But CPUs have limited decoder lengths per clock tick, so the more instructions you fit in a given size, the more parallel a modern CPU can potentially execute. So.... definitely still use XOR trick today. But really, let the compiler handle it. Its pretty good at keeping track of these things in practice. ----------- I'm not sure if "sub" is hard-coded to be recognized in the decoder as a zero'd out allocation from the register file. There's only certain instructions that have been guaranteed to do this by Intel/AMD. |
|
Here's html version: https://zzqcn.github.io/perf/intel_opt_manual/3.html#clearin...
AMD has similar list in "2.9.2 Idioms for Dependency removal" from "Software Optimization Guide for the AMD Zen5 Microarchitecture" document: https://docs.amd.com/v/u/en-US/58455_1.00