| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Solarsail 3259 days ago

Disclaimer: Going off memory of the presentations myself, not at all an expert, etc.

Ivan Goddard's descriptions of the branch predictor sounds quite simple that I can remember; The predictor itself is completely detached from the compiler, outside of the instruction layout falling into EBBs. A point that (I think is on the Mill forum somewhere) they've said about the prediction algorithim itself, is that they're doing simulations with nearly the simplest known prediction algorithm, and getting something like 90% success rates from that compared to the 98% the state-of-the-art sees. The difference is entirely in what data is carried by the prediction. (I believe this is in the forum thread on Switch statements, referenced in this video) What the predictor stores and delivers as output is the logical equivalent to an instruction pointer, and where in the EBB they exit, instead of a boolean (or an entire exit table). Which is more state, but not replacing the entire state of the processor. And, as explained in this presentation, the impact of having (effectively) an instruction pointer from the branch predictor is you know what to prefetch from memory. And that pointer, being associated with some other EBB that the predictor might know about, is also enough to ask the predictor for the next prediction, get a second prefetch, and so on.

That also has implications to a point Goddard made in this presentation, in that some entire Switch statements have only one "EBB" (although that's loose terminology itself), and thus only one prediction associated with it. The prediction is to where the block will exit, not whether any specific prediction will succeed, thus there is no case where the processor will stall multiple times before finding the correct control flow. It can only fail once before it determines the correct exit to the block (in general, at least...)

Which is I believe how they are expecting good results, rather than anything peculiar about how they determine the prediction from history.

1 comments

cpleppert 3259 days ago

I'm not sure I follow about the benefits of this branch prediction scheme in the first place. I already have a branch target from the branch target buffer( assuming I'm in a conventional processor), so either the frontend can begin immediately loading code from the cache or I stall while I load from main memory. I'm never prefetching anything(at least in terms of an initial target)

Let's assume that Mill can produce a prefetch chain though that can load additional code segments beyond the initial branch target. Does this really benefit the processor? I don't understand how a static prediction of where I will branch to next after an initial branch target is even useful. If I have already branched to that location before the code will be sitting in the L1 cache waiting for me, if it isn't, well, how does the branch predictor know more about the behavior of my program than the L1 cache?

To put it another way: if the instruction set doesn't fit in the L1 cache is there some predictable way that I can load and unload the cache just based on a given branch target( that in turn implies other branch targets will be loaded next with a total instruction set larger than the cache). I don't really think there is any possible way to predict the behavior of an entire program like this. Larger instruction segments and more branch targets imply more complex programs that are correspondingly harder to predict so I'm always better off just relying on the instruction cache.

Assuming Mill tries to load entire chains of instructions into the cache based on a simple static target I would be much better off just disabling this and relying on the instruction cache in almost all cases therefore.

igodard 3259 days ago

The target is not static; it is dynamically updated as part of history.

Prefetch chaining is to get code out of DRAM, and it runs DRAM-latency ahead of execution. Separately, fetch chaining is to get the code up to the L1, and it runs L2/L3-latency ahead of execution. Load chaining gets the lines from the L1 to the L0 micro-caches and the decoders, and runs decode-latency (3 cycles on the Mill) ahead of execution.

The Mill stages instructions like this because the further ahead in time an action is the more likely that the action is down a path that execution will not take. We don't prefetch to the L1 because we don't have enough confidence that we will need the code to be willing to spam a scarce resource like the L1. But avoiding a full DRAM hit is important too, so we stage the fetching. It doesn't matter at all in small codes that spend all their time in a five-line loop, but that's not the only kind of codes there are :-)