Hacker News new | ask | show | jobs
by AlexCoventry 415 days ago
MoE models route each token, in every transformer layer, to a set of specialized feed-forward networks (fully-connected perceptrons, basically), based on a score derived from the token's current representation.
1 comments