Hacker News new | ask | show | jobs
by bick_nyers 475 days ago
Just to add onto this point, you expect different experts to be activated for every token, so not having all of the weights in fast memory can still be quite slow as you need to load/unload memory every token.
1 comments

Probably better to be moving things from fast memory to faster memory than from slow disk to fast memory.