Hacker News new | ask | show | jobs
by famouswaffles 1062 days ago
It's just scale. But scale that comes with more than an order of magnitude more expense than the Llama models. I don't see anyone training such a model and releasing it for free anytime soon
1 comments

I thought it was revealed to be fundamentally ensemblamatic in a way the others weren’t? Using “experts” I think? Seems like it would meet the bar for “secret sauce” to me
Sparse MoE models are neither new nor secret. The only reason you haven't seen much use of them for LLMs is because they would typically well underperform their dense counterparts.

Until this paper (https://arxiv.org/abs/2305.14705) indicated they apparently benefit far more from Instruct tuning than dense models, it was mostly a "good on paper" kind of thing.

In the paper, you can see the underperformance i'm talking about.

Flan-Moe-32b(259b total) scores 25.5% on MMLU pre Instruct tuning and 65.4 after.

Flan 62b scores 55% before Instruct tuning and 59% after.

This paper came out well after GPT-4, so apparently this was indeed a secret before then.
The user I was replying to was talking about the now and future.

We also have no indication sparse models outperform dense counterparts so it's scale either way.

Is there a difference here between a secret and an unknown? It may well be that some researcher / comp engineer had an idea, tried it out, realized it was incredibly powerful, implemented it for real this time and then published findings after they were sure of it?

I'm more of a mechanical engineering adjacent professional than a programmer and only follow AI developments loosely

The quoted paper yes, but the MoE concept and layers and training is old.

Published as a conference paper at ICLR 2017

OUTRAGEOUSLY LARGE NEURAL NETWORKS: THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton and Jeff Dean