| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nkurz 4585 days ago
	Best case, the core may still block predicted execution shortly after due to running out of non-dependent instructions, until it knows for sure the address it should have branched to. Worst case, the branch can't proceed until the two memory accesses access. You seem very familiar with these issues, but this doesn't sound right to me. Maybe I'm not understanding your terminology, but don't all modern processors support speculative execution? All instructions (including dependent) are executed, but the results are held in the Reorder Buffer until the branch choice is confirmed. If this is still a large issue, why don't Eli's measurements show it to be?

1 comments

pslam 4585 days ago

If the branch target is an address loaded from memory, and there is no cached result for the branch instruction, then there's no way it can predict which instruction to execute next. The target could be anywhere in valid memory.

The reason the measurements don't show it is the micro-benchmark will be predicting very well. In fact it's quite difficult to defeat prediction even for giant codebases, and you probably have bigger issues with L1 thrashing at that point. The more subtle problem is even with prediction, there's a (quite high) limit to the number of unretired speculated instructions. Again, a micro-benchmark won't show that up - you'd need a large function in the inner loop.

I'm making it sound like there's no cost to virtual functions in real applications, but it's there, usually measurable and every little adds up. If anything, I think a better reason to not simply spray "virtual" everywhere is it demonstrates that the author didn't understand the data structures they created.

nkurz 4585 days ago

On the other hand, having no cached result for the branch probably correlates strongly with not having the target in I-cache, which means you may be stalling out anyway. It also implies that the branch is not in the middle of a tight loop.

Regarding the size of the ROB, I was wondering about the size a while ago and found an interesting post from someone who measured it for modern Intel processors: Ivy Bridge (168), Sandy Bridge (168), Lynnfield (128), Northwood (126), Yorkfield (96), Palermo (72), and Coppermine (40). http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/

I agree with you on the virtual part. I'm actually a C programmer more interested in how to implement efficient dispatch for interpreters. Eli (the original author) has some good posts on that as well: http://eli.thegreenplace.net/2012/07/12/computed-goto-for-ef...