Hacker News new | ask | show | jobs
by timmg 392 days ago
(If you know) how does that work?

Are the audio/video/images tokenized the same way as text and then fed in as a stream? Or is the training objective different than "predict next token"?

If the former, do you think there are limitations to "stream of tokens"? Or is that essentially how humans work? (Like I think of our input as many-dimensional. But maybe it is compressed to a stream of tokens in part of our perception layer.)

1 comments

Ask Gemini to explain how it was trained

https://g.co/gemini/share/f64c3358d9fa