| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by benlivengood 58 days ago

I don't think the grokking paper is a great argument for the difference between weights and meat. E.g. https://en.wikipedia.org/wiki/Cortical_Labs learning to play Pong.

The tokenizer is, at best, a sensory mechanism as evidenced by 1) the random generation of the tokenization scheme, and 2) vastly different tokenization schemes produce virtually identical behavior. It'd be like if Noah Webster threw a bunch of movable type into a bucket (breaking some words in half) and then drew randomly to make the first English dictionary.

EDIT; I was too cavalier with the comparison of tokenizer to sensory modality; my ultimate point is that direct byte-to-token transformers can achieve similar overall performance which to me makes a weights to meat comparison pretty straightforward, but the particular tokenizer in use certainly has a large impact on both efficiency and accuracy on specific problems (e.g. digit representation)

2 comments

noosphr 58 days ago

I'm kind of stunned that someone is using my work to tell me I'm wrong. I wrote the code for the dish brain pong and encoding information was a huge part of what that experiment was about.

So when I way that the grok paper and the pong paper fundamentally agree I have some idea of what I'm talking about.

link

anon84873628 58 days ago

If you're going to claim the tokenizer is a dictionary then it doesn't really matter what paper you wrote code for.

link

benlivengood 58 days ago

I might have misunderstood the point you are making. I read the original article as "weights are like meat", and so I'm confused by what you consider fractally wrong.

link

noosphr 58 days ago

The point that when the rules the model learns are simple enough they stop being spread out over all the layers and become as easily interpretable as any expert system.

It's just that the rules we feed in the model are extremely poorly defined and we end up with the soup of disjoint rules smeared all across the weights.

This isn't a feature of the models. It's a feature of the training set.

Being shocked that you can store rules in floating point numbers is the same as being shocked you can store rules in integers. It's been a century since Goedel Numbering was invented, we should be used to it by now.

link

simonh 58 days ago

Right, but all of that is still in the weights. The point of the article/joke isn’t literally that there is no grammar, it’s that there is no grammar separate from the weights. It’s all in the weights. And yes, it’s absurd. It’s a joke, but a thought provoking one.

link

throwaway173738 58 days ago

So basically there are rules, we just can’t articulate them and so we can’t decode them from the weights. The Goedel Numbering metaphor is pretty appealing to me. You can represent any finite series of real numbers with a series of computations performed on some other finite series of real numbers. We just happen to be using matrices because the math is easy to parallelize. The trick is to realize that when you know the sequence you have and the sequence you want then you can compute the calculations. If you constrain the calculations to only matrix multiplication then you arrive at the scheme we have.

link

teiferer 58 days ago

> You can represent any finite series of real numbers with a series of computations performed on some other finite series of real numbers.

That statement caught my eye. It's either trivially true or quite clearly wrong, depending on how you mean it.

In the literal meaning it's true. Given any finite set of real numbers, I can easily produce a different set (like taking the original set and adding a number which wasn't in there like one plus the largest or so) from which you can trivially produce the original set computationally.

But if you mean you give me both sets then that can't be true. For example if you give me a single real number as set A and the empty set as set B then I can't create a program which generates set A from set B. Your real number in set A could encode anything.

link

skydhash 58 days ago

> For example if you give me a single real number as set A and the empty set as set B then I can't create a program which generates set A from set B. Your real number in set A could encode anything.

And that’s why in computation theory, the set of symbols is the union of the input and output. As set B is a subset of set A, then the set that govern any program from B to A has set A as its domain.

link

throwaway173738 58 days ago

Sorry I’m not a mathematician but just grug brain and try to make number speak from memory.

link

ufocia 58 days ago

Hubris much? I don't see a necessary contradiction in using someone's work to disprove another aspect of that same person's work.

link

js2 58 days ago

https://news.ycombinator.com/item?id=35079

link

anon84873628 58 days ago

Comparing the tokenizer to sensory processing is a great analogy. That's exactly what your visual cortex and initial layers of the language center are doing: decoding visual representation of text into the internal neural representation.

It's a learned mapping from one representation to another, not some semantic lookup against an exogenous source.

link