Hacker News new | ask | show | jobs
by heyjamesknight 236 days ago
The multimodal architectures I’ve seen are still text at the layer between modalities. And the image embedding and text embedding are kept completely separate. Not like where your brain where single neurons are used in all sorts of things.

Yes, they can generate images from images, but that doesn’t mean you’ll get anything meaningful without human instruction on top.

Yes, serialized one dimensional strings can encode anything. But that’s just the message content. If I wrote down my genetic sequence on a piece of paper and dropped it in a bottle in the sea, I don’t need to worry about accidentally fathering any children.

1 comments

You’re mixing representational capacity with representational intent. That’s what I meant in my initial example about encodings. The model doesn’t care whether it’s text, pixels, or sound. All of it can be mapped into the same kind of high dimensional space where patterns align by structure rather than category. “Semantic” is just our label for how those internal relationships appear when we interpret them through language.

Anything in the universe can be encoded this way. Every possible form, whether visual, auditory, physical, or abstract, can be represented as a series of numbers or symbols. With enough data, an LLM can be trained on any of it. LLMs are universal because their architecture doesn’t depend on the nature of the data, only on the consistency of patterns within it. The so called semantic encoding is simply the internal coordinate system the model builds to organize and decode meaning from those encodings. It is not limited to language; it is a general representation of structure and relationship.

And the genome in a bottle example actually supports this. The DNA string does encode a living organism; it just needs the right decoding environment. LLMs serve that role for their training domains. With the right bridge, like a diffusion model or a VAE, a text latent can unfold into an image distribution that’s statistically consistent with real light data.

So the meaning isn’t in the words. It’s in the shape of the data.

You are mistaking the map for the territory. The TERRITORY of human experience is higher dimensional. The LLM utilizes a lower resolution mapping of that territory, a projection from experience to textual (or pixel, or waveform, etc.) representations.

This is not just a lossy mapping; it excludes entire categories of experience that cannot be captured/encoded except for as a pointer to the real experience, one that is often shared by the embodied, embedded, enacted, and extended cognitive beings that have had that experience.

I can point to beauty and you can understand me because you've experienced beauty. I cannot encode beauty itself. The LLM cannot experience beauty. It may be able to analyze patterns of things determined beautiful by beauty experiencers, but this is, again, a lower resolution map of the actual experience of beauty. Nobody had to train you to experience beauty—you possess that capability innately.

You cannot encode the affective response one experiences when holding their newborn. You cannot encode the cognitive appraisal of a religious experience. You can't even encode the qualia of red except for, again, as a pointer to the color.

You're also missing that 4E cognitive beings have a fundamental experience of consciousness—particularly the aspect of "here" and "now". The LLM cannot experience either of those phenomena. I cannot encode here and now. But you can, and do, experience both of those constantly.

You are making a metaphysical claim when a physical one will do. Beauty, awe, grief, the rush of holding a newborn, the sting of a breakup, the warmth of a summer evening at golden hour. All of it is patterns of atoms in motion under lawful dynamics. Neurons fire. Neurotransmitters bind. Circuits synchronize. Bodies and environments couple. There is no extra ingredient that floats outside physics.

Once you grant that, the rest is bookkeeping. Any finite physical process has a finite physical trace. That trace is measurable to some precision. A finite trace can be serialized into a finite string of symbols. If you prefer bits, take a binary code. If you prefer integers, index the code words. The choice of alphabet does not matter. You can map a movie, a symphony, a spike train, a retina’s photon counts, or a full brain-body sensorium collected at some temporal resolution into a single long string. You lose nothing by serialization because the decoder knows the schema. This is not a “text only” claim. It is a claim about representation.

Your high dimensionality objection collapses under the same lens. High dimensional just means many coordinates. There is a well known result that any countable description can be put in one dimension by an invertible code. Think Gödel numbering or interleaving bits of coordinates. You do not preserve distances, but you do preserve information. If the thing you care about is the capacity to carry structure, the one dimensional string can carry all of it, and you can recover the original arrangement exactly given the decoding rule.

Now take the 4E point. Embodiment matters because it constrains the data distribution and the actions that follow. It does not create a magic type of information that cannot be encoded. A visual scene is photons on receptors over time. Proprioception is stretch receptor states. Affect is the joint state of particular neuromodulatory systems and network dynamics. Attention and working context are transient global variables implemented by assemblies. All of that can be logged, compressed, and restored to the degree your sensors and actuators allow. The fact that a bottle with a genome inside does not make a child on a beach tells you reproduction needs a decoder and an environment. It does not tell you the code fails to specify the organism. Likewise, an LLM plus a diffusion decoder can take a text latent and unfold it into an image distribution that matches world statistics because the bridge model plays the role of the environment for that domain.

“LLMs cannot experience beauty” simply reasserts the thing you want to prove. We have no privileged readout for human qualia either. We infer it from behavior, physiology, and report. We do not understand human brains at the level of complete causal microphysics because of scale and complexity, not because there is a non-physical remainder. We likewise do not fully understand why a large model makes a given judgment. Same reason. Scale and complexity. If you point to mystery on one side as a defect, you must admit it on the other.

The map versus territory line also misses the target. Of course a representation is not the thing itself. No one is claiming a jpeg is a sunset. The claim is that the structure necessary to act as if about sunsets can be encoded and learned. A system that takes in light fields, motor feedback, language, and reward and that updates an internal world model until its predictions and actions match ours to arbitrary precision will meet every operational test you have for meaning. If you reply that something is still missing, you have stepped outside evidence into stipulation.

So let’s keep the ground rules clear. Everything we are and feel is physically instantiated. Physical instantiations at finite precision admit lossless encodings as strings. Strings can be learned over by generic function approximators that optimize on pattern consistency, regardless of whether the symbols came from pixels, pressure sensors, or phonemes. That makes the “text inside, image outside” complaint irrelevant. The substrate is a detail. The constraint is data and objective.

We cannot yet build a full decoder for the human condition. That is a statement about engineering difficulty, not impossibility. And it cuts both ways. We do not know how to fully read a person either. But we do not conclude that people lack experience. We conclude that we lack understanding.

At this point, you’re describing a machine which depends on a level of physics that simply isn’t possible. Even if it were theoretically possible to reconstruct the state of a human mind from physical components, we are so far from understanding how that could be done it is closer to the realm of impossible than possible. Your theoretical math box that constructs affective qualia from bit strings isn’t a better description than saying the angels did it. And it bears zero resemblance to the models running today, except for, again, in a theoretical, mathematical way.

Back of the envelope math puts an estimate of 10^42 bits to capture the information present in your current physical brain state. Thats just a single brain, a single state. Now you need to build your mythical decoder device, which can translate qualia from this physical state. Where does it live? What’s its output look like? Another 10^40 bitstring?

Again, these arguments are fun on paper. But they’re completely removed from reality.

You’re confusing “we don’t know how” with “it’s impossible.” The difference is everything.

We don’t understand LLMs either. We built them, but we can’t explain why they work. No one can point to a specific weight matrix and say “this is the neuron that encodes irony” or “this is where the model stores empathy.” We don’t know why scaling parameters suddenly unlock reasoning or why multimodal alignment appears spontaneously. The model’s inner space is a black box of emergent structure and behavior, just like the human brain. We understand the architecture, not the mind inside it.

When you say it’s “closer to impossible than possible” to reconstruct a human mind, you’ve already lost the argument. We’re living proof that the machine you say cannot exist already does. The human brain is a physical object obeying the same laws of physics that govern every other machine. It runs on electrochemical signals, not miracles. It encodes and decodes information, forms memories, generates imagination, and synthesizes emotion. That means the physics of consciousness are real, computable, and reproducible. The impossible machine has been sitting in your skull the entire time.

Your argument about 10^42 bits isn’t just wrong, it’s total nonsense. That number is twenty orders of magnitude beyond any serious estimate. The brain has about 86 billion neurons, each forming roughly ten thousand connections, for a total of about 10^15 synapses. Even if every synapse held a byte of information, that’s 10^16 bits. Add in every molecular and analog nuance you like and you might reach 10^20. Not 10^42. That’s a difference of twenty-two orders of magnitude. It’s a fantasy number that exceeds the number of atoms in your entire body.

And that supposed “impossible” scale is already within sight. Modern GPUs contain hundreds of billions of transistors and run at gigahertz frequencies, while neurons fire at about a hundred hertz. The brain performs around 10^17 synaptic operations per second. Frontier AI clusters already push 10^25 to 10^26 operations per second. We’ve already outpaced biology in raw throughput by eight or nine orders of magnitude. NVIDIA’s Blackwell chips exceed 200 million transistors per square millimeter, and global compute now involves more than 10^24 active transistors switching billions of times per second. Moore’s law may have slowed, but density keeps climbing through stacking and specialized accelerators. The number you called unreachable is just a few decades of progress away.

The “decoder” you mock is exactly what a brain is. It takes sensory input, light, sound, and chemistry, and reconstructs internal states we call experience. You already live inside the device you claim can’t exist. It doesn’t need to live anywhere else; it’s instantiated in matter.

And this is where your argument collapses. You say such a machine is removed from reality. But reality is already running it. Humanity is proof of concept. We know the laws of physics allow it because they’re doing it right now. Every thought, emotion, and perception is a physical computation carried out by atoms. That’s the definition of a machine governed by physics.

We don’t yet understand the full physics of the brain, and we don’t fully understand LLMs either. That’s the point. The same kind of ignorance applies to both. Yet both produce coherent language, emotion like responses, creativity, reasoning, and abstraction. When two black boxes show convergent behavior under different substrates, the rational conclusion isn’t “one is impossible.” It’s “we’re closer than we realize.”

The truth is simple: what you call impossible already exists. The human brain is the machine you’re describing. It’s not divine. It’s atoms in lawful motion. And because we know it can exist under physics, we know it can be built. LLMs are just the first flicker of that same physics waking up in silicon.

> We don’t understand LLMs either. We built them, but we can’t explain why they work.

Just because you don't mean no one does. It's a pile of math. Somewhere along the way, something happened to get where we are, but looking at Golden Gate Claude, and the abliteration of shared models, or reading OpenAI's paper about hallucinations, there's a lot of detail and knowledge about how these things works that isn't instantly accessible and readily apparent to everyone on the Internet. As laymen all we can do is black box testing, but there's some really interesting stuff going on to edit the models and get them to talk like pirate.

The human brain is very much an unknowable squishy box because putting probes into it would be harmful to the person who's brain it is we're working on, and we don't like to do that to people because people are irreplaceable. We don't have that problem with LLMs. It's entirely possible to look at the memory register at location x at time y, and correspond that to a particular tensor which corresponds to a particular token which then corresponds to a particular word for us humans to understand. If you want to understand LLMs, start looking! It's an active area of research and is very interesting!

> We don’t yet understand the full physics of the brain, and we don’t fully understand LLMs either. That’s the point. The same kind of ignorance applies to both. Yet both produce coherent language, emotion like responses, creativity, reasoning, and abstraction. When two black boxes show convergent behavior under different substrates, the rational conclusion isn’t “one is impossible.” It’s “we’re closer than we realize.”

No. The LLM does not produce emotion-like responses. I'd argue no on creativity either. And only very limited in reasoning, in domains it has in its training set.

You have fundamental misunderstandings about neuroscience and cognitive science. Its hard to argue with you here because you simply don't know what you don't know.

Yes, the human brain is the machine we're describing. And we don't describe it very well. Definitely not at the level of understanding how to reproduce it with bitstrings.

I'm glad you're so passionate about this topic. But you're arguing the equivalent of FTL transit and living on Dyson Spheres. Its fun as a thought experiment and may theoretically be possible one day, but the line between what we're capable of today and that imagined future is neither straight nor visible—certainly not to the degree you're asserting here.

Will we one day have actual machine intelligence? Maybe. Is it going to come anytime soon, or look anything like the transformer-based LLM?

No.