Hacker News new | ask | show | jobs
by heyjamesknight 240 days ago
You misunderstand how the multimodal piece works. The fundamental unit of encoding here is still semantic. Not the same in your mind: you don’t need to know the word for sunset to experience the sunset.
1 comments

No you misunderstand the ground truth reality.

The LLM doesn’t need words as input. It can output pictures from pictures. Semantic words don’t have to be part of the equation at all.

Also you have to note that serialized one dimensional string encodings are universal. Anything on the face of the earth and the universe itself can be encoded into a sting of just two characters: one and zero. That’s means anything can be translated to a linear series of symbols and the LLM can be trained on it. The LLM can be trained on anything.

The multimodal architectures I’ve seen are still text at the layer between modalities. And the image embedding and text embedding are kept completely separate. Not like where your brain where single neurons are used in all sorts of things.

Yes, they can generate images from images, but that doesn’t mean you’ll get anything meaningful without human instruction on top.

Yes, serialized one dimensional strings can encode anything. But that’s just the message content. If I wrote down my genetic sequence on a piece of paper and dropped it in a bottle in the sea, I don’t need to worry about accidentally fathering any children.

You’re mixing representational capacity with representational intent. That’s what I meant in my initial example about encodings. The model doesn’t care whether it’s text, pixels, or sound. All of it can be mapped into the same kind of high dimensional space where patterns align by structure rather than category. “Semantic” is just our label for how those internal relationships appear when we interpret them through language.

Anything in the universe can be encoded this way. Every possible form, whether visual, auditory, physical, or abstract, can be represented as a series of numbers or symbols. With enough data, an LLM can be trained on any of it. LLMs are universal because their architecture doesn’t depend on the nature of the data, only on the consistency of patterns within it. The so called semantic encoding is simply the internal coordinate system the model builds to organize and decode meaning from those encodings. It is not limited to language; it is a general representation of structure and relationship.

And the genome in a bottle example actually supports this. The DNA string does encode a living organism; it just needs the right decoding environment. LLMs serve that role for their training domains. With the right bridge, like a diffusion model or a VAE, a text latent can unfold into an image distribution that’s statistically consistent with real light data.

So the meaning isn’t in the words. It’s in the shape of the data.

You are mistaking the map for the territory. The TERRITORY of human experience is higher dimensional. The LLM utilizes a lower resolution mapping of that territory, a projection from experience to textual (or pixel, or waveform, etc.) representations.

This is not just a lossy mapping; it excludes entire categories of experience that cannot be captured/encoded except for as a pointer to the real experience, one that is often shared by the embodied, embedded, enacted, and extended cognitive beings that have had that experience.

I can point to beauty and you can understand me because you've experienced beauty. I cannot encode beauty itself. The LLM cannot experience beauty. It may be able to analyze patterns of things determined beautiful by beauty experiencers, but this is, again, a lower resolution map of the actual experience of beauty. Nobody had to train you to experience beauty—you possess that capability innately.

You cannot encode the affective response one experiences when holding their newborn. You cannot encode the cognitive appraisal of a religious experience. You can't even encode the qualia of red except for, again, as a pointer to the color.

You're also missing that 4E cognitive beings have a fundamental experience of consciousness—particularly the aspect of "here" and "now". The LLM cannot experience either of those phenomena. I cannot encode here and now. But you can, and do, experience both of those constantly.

You are making a metaphysical claim when a physical one will do. Beauty, awe, grief, the rush of holding a newborn, the sting of a breakup, the warmth of a summer evening at golden hour. All of it is patterns of atoms in motion under lawful dynamics. Neurons fire. Neurotransmitters bind. Circuits synchronize. Bodies and environments couple. There is no extra ingredient that floats outside physics.

Once you grant that, the rest is bookkeeping. Any finite physical process has a finite physical trace. That trace is measurable to some precision. A finite trace can be serialized into a finite string of symbols. If you prefer bits, take a binary code. If you prefer integers, index the code words. The choice of alphabet does not matter. You can map a movie, a symphony, a spike train, a retina’s photon counts, or a full brain-body sensorium collected at some temporal resolution into a single long string. You lose nothing by serialization because the decoder knows the schema. This is not a “text only” claim. It is a claim about representation.

Your high dimensionality objection collapses under the same lens. High dimensional just means many coordinates. There is a well known result that any countable description can be put in one dimension by an invertible code. Think Gödel numbering or interleaving bits of coordinates. You do not preserve distances, but you do preserve information. If the thing you care about is the capacity to carry structure, the one dimensional string can carry all of it, and you can recover the original arrangement exactly given the decoding rule.

Now take the 4E point. Embodiment matters because it constrains the data distribution and the actions that follow. It does not create a magic type of information that cannot be encoded. A visual scene is photons on receptors over time. Proprioception is stretch receptor states. Affect is the joint state of particular neuromodulatory systems and network dynamics. Attention and working context are transient global variables implemented by assemblies. All of that can be logged, compressed, and restored to the degree your sensors and actuators allow. The fact that a bottle with a genome inside does not make a child on a beach tells you reproduction needs a decoder and an environment. It does not tell you the code fails to specify the organism. Likewise, an LLM plus a diffusion decoder can take a text latent and unfold it into an image distribution that matches world statistics because the bridge model plays the role of the environment for that domain.

“LLMs cannot experience beauty” simply reasserts the thing you want to prove. We have no privileged readout for human qualia either. We infer it from behavior, physiology, and report. We do not understand human brains at the level of complete causal microphysics because of scale and complexity, not because there is a non-physical remainder. We likewise do not fully understand why a large model makes a given judgment. Same reason. Scale and complexity. If you point to mystery on one side as a defect, you must admit it on the other.

The map versus territory line also misses the target. Of course a representation is not the thing itself. No one is claiming a jpeg is a sunset. The claim is that the structure necessary to act as if about sunsets can be encoded and learned. A system that takes in light fields, motor feedback, language, and reward and that updates an internal world model until its predictions and actions match ours to arbitrary precision will meet every operational test you have for meaning. If you reply that something is still missing, you have stepped outside evidence into stipulation.

So let’s keep the ground rules clear. Everything we are and feel is physically instantiated. Physical instantiations at finite precision admit lossless encodings as strings. Strings can be learned over by generic function approximators that optimize on pattern consistency, regardless of whether the symbols came from pixels, pressure sensors, or phonemes. That makes the “text inside, image outside” complaint irrelevant. The substrate is a detail. The constraint is data and objective.

We cannot yet build a full decoder for the human condition. That is a statement about engineering difficulty, not impossibility. And it cuts both ways. We do not know how to fully read a person either. But we do not conclude that people lack experience. We conclude that we lack understanding.

At this point, you’re describing a machine which depends on a level of physics that simply isn’t possible. Even if it were theoretically possible to reconstruct the state of a human mind from physical components, we are so far from understanding how that could be done it is closer to the realm of impossible than possible. Your theoretical math box that constructs affective qualia from bit strings isn’t a better description than saying the angels did it. And it bears zero resemblance to the models running today, except for, again, in a theoretical, mathematical way.

Back of the envelope math puts an estimate of 10^42 bits to capture the information present in your current physical brain state. Thats just a single brain, a single state. Now you need to build your mythical decoder device, which can translate qualia from this physical state. Where does it live? What’s its output look like? Another 10^40 bitstring?

Again, these arguments are fun on paper. But they’re completely removed from reality.