Hacker News new | ask | show | jobs
by Torkel 664 days ago
Multimodal models are not limited by semantic understanding though, right?

They are given photos, video, audio.

1 comments

For them, it is all a stream of tokens, roughly speaking.