Y
Hacker News
new
|
ask
|
show
|
jobs
by
terafo
925 days ago
Attention is shared. It's ~30% of params here. So ~2B params are shared between experts and ~5B params are unique to each expert.