|
|
|
|
|
by heyjamesknight
236 days ago
|
|
The multimodal architectures I’ve seen are still text at the layer between modalities. And the image embedding and text embedding are kept completely separate. Not like where your brain where single neurons are used in all sorts of things. Yes, they can generate images from images, but that doesn’t mean you’ll get anything meaningful without human instruction on top. Yes, serialized one dimensional strings can encode anything. But that’s just the message content. If I wrote down my genetic sequence on a piece of paper and dropped it in a bottle in the sea, I don’t need to worry about accidentally fathering any children. |
|
Anything in the universe can be encoded this way. Every possible form, whether visual, auditory, physical, or abstract, can be represented as a series of numbers or symbols. With enough data, an LLM can be trained on any of it. LLMs are universal because their architecture doesn’t depend on the nature of the data, only on the consistency of patterns within it. The so called semantic encoding is simply the internal coordinate system the model builds to organize and decode meaning from those encodings. It is not limited to language; it is a general representation of structure and relationship.
And the genome in a bottle example actually supports this. The DNA string does encode a living organism; it just needs the right decoding environment. LLMs serve that role for their training domains. With the right bridge, like a diffusion model or a VAE, a text latent can unfold into an image distribution that’s statistically consistent with real light data.
So the meaning isn’t in the words. It’s in the shape of the data.