|
|
|
|
|
by microarchitect
4713 days ago
|
|
Sorry, this explanation is almost surely incorrect. How long something is available on the bypass network is determined by how long the instruction that produces the value takes to "exit" the pipeline. I can't imagine any scenario where a consumer instruction causes a producer instruction (i.e., an instruction "ahead" of it) to stall. Note this would be a dangerous design point because of the risk of deadlocks. What's the source for your claim that the Core uarch's register file is underdesigned in comparison to the dispatch width? I'd be extremely surprised if this were the case. Last time I looked at the data, about 50-70% of the reads go to the register file not the bypass network. |
|
Intel's optimization manual describes the stall: http://www.intel.com/content/dam/doc/manual/64-ia-32-archite... (3.5.2.1, "ROB Read Port Stalls.").
The optimization manual mentions examples of the stall occurring when e.g. often-used constants are stored in registers, or when a load is hoisted "too high" and the value "goes cold" before its consumers use it.
Agner Fog's manual has a discussion starting on p. 69, 84 of his manual: http://www.agner.org/optimize/microarchitecture.pdf. Note his use of an unnecessary MOV to "refresh" a register to avoid the stall.
I only glanced at the code quickly, but the comment about how he got rid of a load by holding a value in a register made me think the load was keeping the value from "going cold." Of course, I didn't profile it so I'm probably completely wrong...