| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by choudanu4 2606 days ago
	AMD’s primary advertised improvement here is the use of a TAGE predictor, although it is only used for non-L1 fetches. This might not sound too impressive: AMD is still using a hashed perceptron prefetch engine for L1 fetches, which is going to be as many fetches as possible, but the TAGE L2 branch predictor uses additional tagging to enable longer branch histories for better prediction pathways. This becomes more important for the L2 prefetches and beyond, with the hashed perceptron preferred for short prefetches in the L1 based on power. I found this paragraph confusing, is it talking about data prefetchers (Which would make sense b/c of the mention of short prefetches) or branch predictors? (Which would make sense b/c of the mention of TAGE and Perceptron)

1 comments

derefr 2606 days ago

A little of both. My understanding of the above paragraph is that the L1 predictor is trying to predict which code-containing cache lines need to stay loaded in L1, and which can be released to L2, by determining which branches from L1 cache-lines to L1 cache-lines are likely to be taken in the near future. Since L1 cache lines are so small, the types of jumps that can even be analyzed successfully have very short jump distances—i.e. either jumps within the same code cache-line, or to its immediate neighbours. The L1 predictor doesn’t bother to guess the behaviour of jumps that would move the code-pointer more than one full cache-line in distance.

Or, to put that another way, this reads to me like the probabilistic equivalent of a compiler doing dead code elimination on unconnected basic blocks. The L1 predictor is marking L1 cache lines as “dead” (i.e. LRU) when no recently-visited L1 cache line branch-predicts into them.

BeeOnRope 2606 days ago

I was also confused by this, but my reading is this is entirely about branch prediction nothing about caching. In that context L1 and L2 simply refer to "first" and "second" level branch prediction strategies, and are not related to the L1 and L2 cache (in the same way that L1 and L2 BTB and L1 and L2 TLB are not related to L1 and L2 cache).

The way this works is there a fast predictor (L1) that can make a prediction every cycle, or at worst every two cycles, which initially steers the front end. At the same time, the slow (L2) predictor is also working on a prediction, but it takes longer: either throughput limit (e.g., one prediction every 4 cycles) or with a long latency (e.g., takes 4 cycles from the last update to make a new one). If the slow predictor ends up disagreeing with the fast one, the front end if "re-steered", i.e., repointed to the new path predicted by the slow predictor.

This happens only in a few cycles so it is much better than a branch misprediction: the new instructions haven't started executing yet, so it is possible the bubble is entirely hidden, especially if IPC isn't close to the max (as it usually is not).

Just a guess though - performance counter events indicate that Intel may use a similar fast/slow mechanism.