|
|
|
|
|
by Paul-Craft
1101 days ago
|
|
This seems like just another way of saying that when you train an LLM on a text, its weights incorporate the tokens in that text, which is nothing really profound. I think the real magic here comes from the fact that LLMs are a specialized sort of neural network, and that neural networks are universal approximators [0]. In other words, LLMs are general learners because they are neural networks. This is also not particularly profound, except that there are mathematical proofs of the universal approximation theorem that give us insight into why it must be so. --- [0]: https://en.wikipedia.org/wiki/Universal_approximation_theore... |
|
Before transformers we built different neural network architectures for each domain. These architectures offered better inductive biases for their respective domains and thus traded off some of the expressivity for better learnability and generalization.
Nowadays the best architectures seem to be merging towards transformers. They appear to offer more generally useful inductive biases and thus a better trade-off between the three ingredients than the earlier architectures.