|
|
|
|
|
by scarmig
815 days ago
|
|
> Lately I've been wondering... is this a problem, or a strength? It probably depends. But an idea I've been playing with: because transformers have such a strong ability for recall during inference, they might be introducing a strong inductive bias for memorization as opposed to generalization. Why bother to build a complete world model when you can just attend to the answer? The global minimum in loss (at least for the training dataset) would use those memorizing and interpolating circuits over those that generalize well. This seems consistent with LLMs as they exist today: superhuman at recall, very mediocre at reasoning. Though, for what it's worth, existing SSSMs haven't yet shown they can outperform (or even match) transformers when it comes to reasoning. If this hypothesis were true, you might expect to see grokking in state space models more quickly than in transformer models. (Even if it's hard to train transformers to generalize, superhuman recall is still incredibly valuable, and likely a hybrid system would offer the best of both worlds.) |
|