|
|
|
|
|
by ademeure
985 days ago
|
|
As a former GPU architect, that's really interesting, thanks! I didn't realise A53's caches were strictly in-order and couldn't service hits ahead of misses, I always assumed this was something even much simpler designs were capable of. I think complexity of verification as an argument against out-of-order is questionable, because if out-of-order resulted in a better core and a competitor did manage to build and properly verify such a core, then they would have a strong competitive advantage. But that might not be true in practice given the area/power cost. As an aside: different GPU vendors also have different limitations when it comes to in-order vs out-of-order caches, and GPUs have the extra complexity that loads are effectively doing "gather", e.g. 32-wide warps doing a load with 32 addresses that may or may not uniquify, so a single "return" to the shader processor may be anything from 1 to 32 (or even 64) cachelines. And GPU gets even more tricky with the texture unit doing trilinear+anisotropic filtering, so a single pixel may require 32x as many inputs, and you may even get into situations where the cache isn't big enough (or doesn't have enough ways) to handle the worst case and you have to revert to in-order for certain modes, or process things at a finer granularity than entire warps! Or just do in-order for everything with huge latency FIFOs and accept the latency cost. Lots of different ways to handle this, also depending on what granularity of returns your shader processor can handle. As you said, both modern CPUs and GPUs can't really be defined using simple labels. Gather makes things a lot harder for load pipelines so I'm not surprised Zen4 seems to still just split it into uOps, but I'm curious exactly how Intel solves handles it in their CPU microarchitecture. Sadly this is the kind of thing that's practically impossible to know as an outsider! |
|