Hacker News new | ask | show | jobs
by rayiner 90 days ago
Your mental model is close. Predictors generally work by having some sort of table of predictions and indexing into that table (usually using some sort of hashing) to obtain the predictions.

The simplest thing to do is use the address of the branch instruction as the index into the table. That way, each branch instruction maps onto a (not necessarily unique) entry in the table. Those entries will usually be a two-bit saturating counter that predicts either taken, not taken, or unknown.

But you can add additional information to the key. For example, a gselect predictor maintains a shift register with the outcome of the last M branches. Then it combines that shift register along with the address of the branch instruction to index into the table: https://people.cs.pitt.edu/~childers/CS2410/slides/lect-bran... (page 9). That means that the same branch instruction will map to multiple entries of the table, depending on the pattern of branches in the shift register. So you can get different predictions for the same branch depending on what else has happened.

That, for example, let’s you predict small-iteration loops. Say you have a loop inside a loop, where the inner loop iterates 4 times. So you’ll have a taken branch (back to the loop header) three times but then a not-taken branch on the fourth. If you track that in the branch history shift register, you might get something like this (with 1s being taken branches):

11101110

If you use this to index into a large enough branch table, the table entries corresponding to the shift register ending in “0111” will have a prediction that the branch will be not taken (i.e. the next outcome will be a 0) while the table entries corresponding to the shift register ending in say “1110” will have a prediction that the next branch will be taken.

So the basic principle of having a big table of branch predictions can be extended in many ways by using various information to index into the table.

3 comments

Yeah, the "two-bit saturating counter" thing is pretty much exactly how I thought it worked (which would be terrible for the example in the article), but I had no idea about the fact that it also kept track of the branch history thing, and how that maps to different branch predictor entries. Thanks for the link, that's really fascinating!
It seems like that would struggle with detecting how many layers of branching to pay attention to. Imagine the two nested loops surrounded by a randomized one. Wouldn't that implementation keep hitting patterns it hadn't seen before?

Obviously that must be a solved problem; I'd be curious to know what the solution is.

Can you walk through why you think the random outer loop would interfere?

If the inner loop's behaviour is predictable no matter the outer loop, then because the branch predictor is keyed by instruction address, it can be predicted. Only the inner loop's history is considered.

Or maybe I'm misunderstanding what code you're imagining?

It seems likely I've fundamentally misunderstood gselect or some other aspect here.

If you only look at the history of a single address independently then don't shallow nested loops with dependent behavior completely break the entire scheme?

Whereas with global history I think it mostly works. Maybe? But in that case what happens when the inner branch is trivially predictable but shallow and enclosed by ones that aren't?

In the linked article the branch always has the same address. AMD starts gradually degrading at 30k. Doesn't that indicate 16 bits of history? So for a single trivially predictable inner branch wouldn't you expect an initial 14 mispredictions? That seems like a lot.

My (I suspect very flawed) mental model here is global branch history with a 16 bit shift register and a 16 bit table, the latter keyed on hash( register ) ^ hash( address ) to explain the observed behavior.

Ahh maybe I am the one who is misunderstanding... I'm by no means an expert on this (and it is quite complex)
Looking at the linked slides again my mental model is what it refers to as "gshare". A tournament predictor between that and "local" addresses the two scenarios I posed but still has obvious failure modes.

Local: if( rng() ){ if( true ){ ... }}

Global: if( f ){} if( !f ){}

Trashed state: if( rng() ){} if( f ){} if( !f ){}

But notice that the third happens naturally (ie no need for an RNG) any time the history depth doesn't match up nicely with the looping pattern. Hence my initial question about how real world implementations determine how many layers to pay attention to. You could solve it with a tree structure but do hardware implementers go that far?

might be but what real code does that ?
It even gets much more sophisticated than that. Even the first Ryzen had something perceptron-based (yes, neuronal network!) and there are several predictors + logic to pick the best one for a branch.