Hacker News new | ask | show | jobs
by sigmoid10 15 days ago
How do you think the other modalities are fed into the attention layers? The other modalities are tokenized as well, that's literally what these separate image/audio encoders created as output before feeding it into the main network. Tokenization is at its core just a tradeoff between sequence length and embedding size, so it will probably stay relevant as long as attention layers scale quadratically with sequence length.