|
|
|
|
|
by ekropotin
92 days ago
|
|
I’m not very well versed in this domain, but I think it’s not going to be “VRAM” (GDDR) memory, but rather “unified memory”, which is essentially RAM (some flavour of DDR5 I assume). These two types of memory has vastly different bandwidth. I’m pretty curious to see any benchmarks on inference on VRAM vs UM. |
|
So for an LLM inference is relatively slow because of that bandwidth, but you can load much bigger smarter models than you could on any consumer GPU.