|
|
|
|
|
by timmg
392 days ago
|
|
(If you know) how does that work? Are the audio/video/images tokenized the same way as text and then fed in as a stream? Or is the training objective different than "predict next token"? If the former, do you think there are limitations to "stream of tokens"? Or is that essentially how humans work? (Like I think of our input as many-dimensional. But maybe it is compressed to a stream of tokens in part of our perception layer.) |
|
https://g.co/gemini/share/f64c3358d9fa