Hacker News new | ask | show | jobs
by whimsicalism 502 days ago
It probably also has to do with their internal infra. If it were just about dense models being easier for the OSS community to use & build on, they should probably be training MoEs and then distilling to dense.