|
|
|
|
|
by iknownothow
736 days ago
|
|
I'm probably wrong but the author may have have misunderstood input embeddings. Input embeddings are just dictionary lookup tables. The tokenizer generates tokens and for each token you find its embedding from the lookup. The author is speculating about an embedding model but in reality they're speculating about the image-tokenizer. If I'm not wrong the text tokenizer Tiktoken has a dictionary size of 50k. The image tokenizer could have a very large dictionary size or a very small dictionary size. The 170 tokens this image tokenizer generates might actually have repeating tokens! EDIT: PS. What I meant to say was that input embeddings do not come from another trained model. Tokens come from other trained models. The input embedding matrix undergoes back propagation (learning). This is very important. This allows the model to move the embeddings of the tokens together or apart as it sees fit. If you use embeddings from another model as input embeddings, you're basically adding noise. |
|
But why only choose 13x13 + 1? :(
I'm willing to bet that the author's conclusion of embeddings coming from CNNs is wrong. However, I cannot get the 13x13 + 1 observation out my head though. He's definitely hit on something there. I'm with them that there is very likely a CNN involved. And I'm going to put my bet on the final filters and kernel are the visual vocabulary.
And how do you go from 50k convolutional kernels (think tokens) to always 170 chosen tokens for any image? I don't know...