Hacker News new | ask | show | jobs
by ioulaum 518 days ago
It's not actually a 600B+ model. It's a mixture of experts. The actual models are pretty small and thus don't require as much training to reach a decent point.

It's similar to Mixtral having gotten good performance while not having anywhere near OpenAI class money / compute.

1 comments

> It's not actually a 600B+ model. It's a mixture of experts.

Is this described in the paper or was this inferred from the model itself ?

Just curious, especially if the latter.

It's a 600B+ mixture of experts and yes it's described in the paper, GitHub, etc.