Y
Hacker News
new
|
ask
|
show
|
jobs
by
bick_nyers
475 days ago
Just to add onto this point, you expect different experts to be activated for every token, so not having all of the weights in fast memory can still be quite slow as you need to load/unload memory every token.
1 comments
valine
475 days ago
Probably better to be moving things from fast memory to faster memory than from slow disk to fast memory.
link