|
|
|
|
|
by CamperBob2
8 hours ago
|
|
Inference basically looks like this (neglecting a whole bunch of stuff): for t in tokens_in_context
for p in model_weights
do something with p*t
The expensive part is fetching each weight from memory, which is why VRAM/HBM is such a big deal. Conceptually, for a huge, dense (non-MoE) model, the inner loop might run a trillion times for every token generated.Obviously that's not how it really works in practice, but the point is, if you are only running one prompt at a time, each weight gets fetched, applied to the token being processed, and then never touched again until the next token is processed. So when you submit a prompt to a model that's running a bunch of other peoples' contexts concurrently, it can reuse each weight multiple times before moving on to the next one: for p in model_weights
for u in users
for t in u's context
do something with p*t
The same is true in an agent-heavy scenario where you have several contexts in play at once.Worst case, in terms of energy efficiency, is a single user sitting around waiting for a single response. I don't feel like I'm explaining it well, but the core idea is that every time a weight is fetched from memory, you want to get as much work done as possible with it. |
|