| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by radq 1089 days ago

If it's similar to the switch transformer architecture [1], which I suspect it is, then the models are all trained on the same corpus and the routing model learns automatically which experts to route to.

It's orthogonal to beam search - the benefit of the architecture is that it allows sparse inference.

[1] https://arxiv.org/pdf/2101.03961.pdf

1 comments

mrfinn 1089 days ago

So in layman's terms does this mean that on top of big base of knowledge (?) they trained 8 different 220B models and each model specialized in different areas, in practice like an 8 units "brain"? PS. Thinking now how human brain does something similar as our brain is split in two parts and each one specialize in different tasks.

link

radq 1089 days ago

Yeah, that's pretty close. It might be more precise to say they trained one big model that includes 8 "expert networks" and a mechanism to route between them, since everything is trained together.

There isn't a lot of public interpretability work on mixture-of-expert transformer models, but I'd suspect the way they specialize in tasks is going to be pretty alien to us. I would be surprised if we find that one of the expert networks is used for math, another for programming, another for poetry etc. It's more likely we'll see a lot of overlap between the networks going off of Anthropic's work on superposition [1], but who really knows?

[1] https://transformer-circuits.pub/2022/toy_model/index.html

link

mrfinn 1089 days ago

Thank you for the explanation, I still have a hard time understanding how transformers work so amazingly well and tech is already quite a few steps over that idea.

link

radq 1088 days ago

Andrej Karpathy's "zero to hero" series [1] was how I learned the fundamentals of this stuff. It's especially useful because he explains the why and provides intuitive explanations instead of just talking about the what and how. Would recommend it if you haven't checked it out already.

[1] https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxb...

link

londons_explore 1089 days ago

They probably trained all 8 experts on the same data. The experts may have become good at different topics, but no human divided up the topics.

The output isn't just the best of the 8 experts - it is a blend of the opinions of the experts. Another (usually smaller) neural net decides how to blend together the outputs of the networks, probably on a per-token basis (ie. for each individual word (ie. token), the outputs of all the experts is consulted, and then blended together, and a word picked (sampled), before moving onto the next word)

link

mrfinn 1089 days ago

I guess that neural network has to have the capability of identifying the subject and know in every moment which network is the most capable for that subject, otherwise I can't understand how it could possibly evaluate which is the best answer.

link

londons_explore 1089 days ago

Results of this sort of system frequently look almost random to human eyes. For example one expert might be the "capital letter expert", doing a really good job of putting capital letters in the right place in the output.

link