Hacker News new | ask | show | jobs
by terafo 925 days ago
Attention is shared. It's ~30% of params here. So ~2B params are shared between experts and ~5B params are unique to each expert.