| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by eldenring 548 days ago
	They're not new in the same way Attention wasn't new when the transformer paper was written. No one (publically) had really pushed any of these techniques far, especially not for such a big run.

1 comments

whimsicalism 548 days ago

no one publicly pushes any techniques very far except for meta and it’s true they continue to train dense models for whatever reason.

the transformer was an entirely new architecture, very different step change than this

e: and alibaba

link

leetharris 548 days ago

They likely continue to train dense models because they are far easier to fine tune and this is a huge use case for the Llama models

link

whimsicalism 548 days ago

It probably also has to do with their internal infra. If it were just about dense models being easier for the OSS community to use & build on, they should probably be training MoEs and then distilling to dense.

link