My understanding is that, while all 8B are loaded into memory, for each token inference step only 2B are selected and used - so tokens are produced faster because there is less computation needed.
Hoping someone will correct me if that's not the right mental model!
Seems like it could be somewhat useful for people with poor eyesight or blindness