|
|
|
|
|
by benlivengood
8 days ago
|
|
I don't think the grokking paper is a great argument for the difference between weights and meat. E.g. https://en.wikipedia.org/wiki/Cortical_Labs learning to play Pong. The tokenizer is, at best, a sensory mechanism as evidenced by 1) the random generation of the tokenization scheme, and 2) vastly different tokenization schemes produce virtually identical behavior. It'd be like if Noah Webster threw a bunch of movable type into a bucket (breaking some words in half) and then drew randomly to make the first English dictionary. EDIT; I was too cavalier with the comparison of tokenizer to sensory modality; my ultimate point is that direct byte-to-token transformers can achieve similar overall performance which to me makes a weights to meat comparison pretty straightforward, but the particular tokenizer in use certainly has a large impact on both efficiency and accuracy on specific problems (e.g. digit representation) |
|
So when I way that the grok paper and the pong paper fundamentally agree I have some idea of what I'm talking about.