Y
Hacker News
new
|
ask
|
show
|
jobs
by
fspeech
217 days ago
It uses 75% linear attention layers so it is inherently lower cost. And it is MOE so active parameters are far lower.