|
|
|
|
|
by rayiner
4713 days ago
|
|
The P6 has only two read ports in its permanent register file for operand values: http://www.cs.tau.ac.il/~afek/p6tx050111.pdf (p. 36). P-M upped it to three, and Sandy Bridge removed the limitation completely. Intel's optimization manual describes the stall: http://www.intel.com/content/dam/doc/manual/64-ia-32-archite... (3.5.2.1, "ROB Read Port Stalls."). The optimization manual mentions examples of the stall occurring when e.g. often-used constants are stored in registers, or when a load is hoisted "too high" and the value "goes cold" before its consumers use it. Agner Fog's manual has a discussion starting on p. 69, 84 of his manual: http://www.agner.org/optimize/microarchitecture.pdf. Note his use of an unnecessary MOV to "refresh" a register to avoid the stall. I only glanced at the code quickly, but the comment about how he got rid of a load by holding a value in a register made me think the load was keeping the value from "going cold." Of course, I didn't profile it so I'm probably completely wrong... |
|
In designs which rename using the ROB, the register file holds values produced by instructions which are completed and retired, the ROB holds values from instructions that are completed but not retired, and the bypass network supplies values from instructions currently completing.
What Agner is doing in his example with the seemingly useless instruction is transferring a value from the the register file to the ROB so that instructions which try to read logical register ECX will now source it from the ROB instead of the register file. But when I look at the code in the stack overflow question, nothing actually reads from s1. So these are even "more useless" instructions than Agner's example.
Some people have already mentioned instruction alignment issues, so that is one likely explanation. There are a whole bunch of other possible issues involving the scheduler and dispatch restrictions. For example, I've seen processors where there were two pipelines with slightly different instruction schedulers. So adding a useless instruction like this might push your bottleneck instruction into a pipe with a scheduler that is slightly better for your code. Sometimes bypassing across different pipes is more expensive than within the same pipe, so again the useless instruction might push some instructions into pipes that have more of their sources. It could one of any number of reasons and it's going to be very hard to tell from the outside without knowing the details of the microarchitecture.