| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fc417fc802 12 days ago

These are all good points that I agree with but rather than seeing an intractable problem I predict we'll see the role that GDDR would otherwise fill in this scenario replaced by a small block of HBM on the APU die. I don't know if it will ultimately end up unified or not but either way I don't think memory segmentation is the core problem here. Simply not needing to send transfers across the narrow and slow PCIe bus would fix most of the practical problems (at least AFAIK but I'm not an expert).

Transitioning over to wild speculation here, I think that most likely this will be treated as part of an absurdly large L3 (ala 3D V-Cache) or as an additional L4. In either case I expect the latency and power tradeoffs introduced to be tolerated as "good enough" even for the highest end consumer gear. (Actually I wonder if some sort of special case cache would be feasible, with memory addresses flagged by the graphics driver and regular CPU related stuff skipping over it entirely. But by then we've squarely entered the territory of vaguely unhinged rambling on my part.)

Alternatively if the performance caveats are deemed to be important enough to justify the added complexity it wouldn't surprise me to see the HBM treated as an independent memory pool analogous to that of a dGPU. That wouldn't change the current status quo with respect to the GPU APIs but it would significantly ameliorate the memory bandwidth bottleneck for inference workloads and from a software perspective is a drop in replacement. You'd still write the code targeting the dGPU with explicit swapping to RAM but when run on an appropriate APU it would get a massive speedup for free instead of suddenly being starved for bandwidth while also performing unnecessary copy operations.