|
|
|
|
|
by FloatArtifact
475 days ago
|
|
> Probably helps that models like deepseek are mixture of expert. Having all weights in VRAM means you don’t have to unlod/reload. Memory bandwidth usage should be limited to the 37B active parameters. "Memory bandwidth usage should be limited to the 37B active parameters." Can someone do a deep dive above quote. I understand having the entire model loaded into RAM helps with response times. However, I don't quite understand the memory bandwidth to active parameters. Context window? How much the model can actively be processed despite being fully loaded into memory based on memory bandwidth? |
|