Hacker News new | ask | show | jobs
by rahimnathwani 502 days ago
The active bits may change with each token. You need the whole model in memory, even though, for any single token, only a subset of that memory will have been used in computation. The memory efficiency comes when you have multiple sessions in parallel.