Hacker News new | ask | show | jobs
by CamperBob2 8 hours ago
Inference basically looks like this (neglecting a whole bunch of stuff):

    for t in tokens_in_context
        for p in model_weights
            do something with p*t
The expensive part is fetching each weight from memory, which is why VRAM/HBM is such a big deal. Conceptually, for a huge, dense (non-MoE) model, the inner loop might run a trillion times for every token generated.

Obviously that's not how it really works in practice, but the point is, if you are only running one prompt at a time, each weight gets fetched, applied to the token being processed, and then never touched again until the next token is processed.

So when you submit a prompt to a model that's running a bunch of other peoples' contexts concurrently, it can reuse each weight multiple times before moving on to the next one:

    for p in model_weights
        for u in users
           for t in u's context
              do something with p*t
The same is true in an agent-heavy scenario where you have several contexts in play at once.

Worst case, in terms of energy efficiency, is a single user sitting around waiting for a single response. I don't feel like I'm explaining it well, but the core idea is that every time a weight is fetched from memory, you want to get as much work done as possible with it.

1 comments

That makes a lot of sense, thank you. I think a pirate cloud of local models could make sense, but that would be regulated into oblivion