Hacker News new | ask | show | jobs
by WorldMaker 44 days ago
Embeddings are still mostly just vectors into n-dimensional K-means clusters. It isn't "knowing" two things are related and here's the evidence, it is guessing two things are statistically likely to be related, based on trained patterns, and running with it without evidence.

It has no "semantic understanding" as we would define it. It's just increasingly good at winning cluster lotteries because we've increased the amount of training data to incredible heights.

1 comments

Can you explain how you "know" two things are related? If I ask you the similarities between a cat and a dog, is your answer based solely on an understanding of their genetic phylogeny and how those genes express traits?

Grouping vectors in concept space is exactly how you create semantic understanding. The proof is in how good they are at creating semantically valid text. The fact that it took massive amounts of data is irrelevant. That just shows how much knowledge is encoded in all our language. It takes humans a ton of training to know things too.

> is exactly how

We don't know that. It seems like great hubris to declare we know how the human brain works. You are asking me to explain how we know things and then telling me we've already figured it out in the same breath, and that's hilarious.

It doesn't take massive amounts of language data to train a baby human. It is almost entirely just: "Look. Here's a cat. Can you say cat? Cats go meow." "Over here, your aunt has a dog. Dogs go woof."

There's generally a flood of non-lingual contextual data in such moments such as sights, smells, sounds, movements, touch but that also only further underscores how different LLM training is from anything we'd consider human learning. Our memories aren't just "conceptual spaces of linguistic topics", they are complex sensory maps where a smell can remind you of the first dog you ever met. There is so much of our human knowledge that is not and never been encoded in most of our languages.

The fact that LLMs take massive amounts of linguistic data is relevant, because it shows how far we still have to go in barely scratching the surface of how the human brain seems to work. (Which again, we know only the barest details. Anyone who tells you they know 100% of how the human brain operates so far tends to be a snake oil salesman.)

We do mostly know how the brain works at this level of detail, and it is akin to Principal Component Analysis. There are only so many ways it could work, unless you believe in dualism. My question was rhetorical. All you've described with the other stuff is a "multi-modal" model (and ignoring all of the "biological pre-training" that took place through millennia of evolution). The interesting (and perhaps surprising to some people) thing is how well pure text training can compensate for the lack of other senses.
Cool attempt at an ANY% speedrun of biology and philosophy, bro. Maybe next time shoot for a higher percentage score?

"Neural Networks" are the Omegaverse of Computing and we are all poorer for it. I could elaborate, but I'm exhausted and depressed right now. The map is not the territory. The broken analogy is almost never the real thing. A stopped clock is right thousands of times per year if you just keep collecting as much data as you can.