| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by noosphr 22 days ago

It's not often I see something that's fractally wrong but here we are.

There is a dictionary, it's called the tokenizer.

There are grammar rules, they are just very weak because the structure of human language is generally quite weak. When presented with languages which have strong consistent grammars the weights are very easily interpretable as a grammar: https://arxiv.org/abs/2201.02177

The point of the original short story is that the computational substrate doesn't matter when you have Turing completeness. This one seems to think that you don't need structure and interpretability just because you change substrates.

11 comments

phire 21 days ago

The tokeniser is not a dictionary. It doesn't provide definitions, or give the LLM any kind of mapping at all.

At best, it's a wordlist. It gives the LLM some idea of what humans consider to be common words. But it doesn't tell the LLM anything at all about those words. And it's not even comprehensive, many words map to multiple tokens. Nor is it exclusively words, some of those tokens are punctuation, or modifiers, or control tokens. On multimodal LLMs, some of the tokens actually represent image and audio data.

The LLM doesn't get informed about any of this up front, it has to learn what every single token means from context.

You are technically right, that it's something in an LLM that's not weights; But it's not that structured. And really it's only there so the LLM can interact with the outside world.

> There are grammar rules

There is no dedicated "grammar rule" structure in the LLM or the tokeniser. It has to learn them all from context, they get encoded as part of the 80 layers of weights.

ozgung 21 days ago

I see people give too much importance to specific engineering design choices of the current generation of LLMs. Tokenizer is not an absolutely essential part of the system. It’s just and adapter for text input/output. It can be eliminated completely and model can use bytes directly.

I think the short story captures this well. Weights (connections) are the essential and philosophically important part. They do the thinking, memory, singing etc.

yencabulator 21 days ago

A tokenizer is roughly and approximately Huffman-coding sequences of input (bytes of English etc) into shorter sequences (list of tokens), as a performance optimization.

As you said, it's not in any way intrinsic to the LLM, though it may be a very necessary optimization on today's hardware.

phire 21 days ago

I wouldn't use the word necessary.

IMO, we are probably talking about a 6x slow down (for typical english). You would need to be absolutely stupid not to implement some kind of optimisation along these lines.

Slower and maybe a little dumber; But it would work.

kgwgk 21 days ago

Not sure about “dumber” - it may be better than SOTA models at identifying which days of the week contain the letter “d”.

phire 21 days ago

True, it would be better at some tasks.

My thinking is that for most tasks, a byte-orientated LLM still needs something like the wide "single activation per word" formatting that the tokeniser mostly provides. And it will likely waste its first and last few layers implementing a replacement tokeniser (and would probably do a much better job at it). It would also need to decode and encode unicode at the same time.

My estimate is that it might lose about 10% of its weights to these new tasks. Your 80B parameter model becomes as smart as a 72B parameter model - Measurably dumber, but not drastically so.

teiferer 21 days ago

> The point of the original short story is that the computational substrate doesn't matter when you have Turing completeness.

That is your takeaway from the 1991 story?

famouswaffles 21 days ago

>There are grammar rules, they are just very weak because the structure of human language is generally quite weak. When presented with languages which have strong consistent grammars the weights are very easily interpretable as a grammar: https://arxiv.org/abs/2201.02177

That paper did not train the models on 'a language with strong consistent grammars'. Mathematical Operation tables are not a language. Grammar itself is a post-hoc rationalization and there's no evidence LLMs follow 'grammar rules' anymore than the brain follows grammar rules. Of Course, that's not to say transformers can't learn simple rules if the dataset calls for it.

danans 21 days ago

> Mathematical Operation tables are not a language.

Not a natural language, but they are certainly a language as in a symbolic representation of information.

Antibabelic 21 days ago

A language is a set of sentences.

A sentence is a finite sequence of symbols drawn from an alphabet.

In this sense, mathematical operation tables are absolutely a language. As are natural languages.

famouswaffles 21 days ago

>A language is a set of sentences. A sentence is a finite sequence of symbols drawn from an alphabet.

A language is a structured system of communication used to express arbitrary ideas between multiple parties. Math operation tables do not, and cannot, do that on their own.

That distinction matters here because we are talking about what properties the model is expected to learn. English and operation tables are fundamentally different objects, so it is not surprising that a model learns different kinds of structure from them.

dpark 21 days ago

A tokenizer is not a dictionary any more than an alphabet is a dictionary.

noosphr 21 days ago

The Chinese alphabet is very much a dictionary. All the major tokenizers are far larger.

dpark 21 days ago

That doesn’t make any sense. A alphabet is a list of valid characters. A dictionary is not just a list. Even in a language like Chinese where individual characters carry meaning, a dictionary tells you what that meaning is. It’s not just a list of characters.

Or to echo article, the dictionary is made out of weights.

simonh 21 days ago

A list of words isn’t a dictionary. What a dictionary adds over a list of words is all the relationships between the words needed to interpret them and use them, and all of that is in the weights.

JdeBP 21 days ago

We should tell the Unix people that they've been giving /usr/share/dict the wrong name for over three decades. (-:

yencabulator 21 days ago

I mean, they did, and we have, and we've also stopped doing that.

https://en.wikipedia.org/wiki/Words_(Unix)

JdeBP 21 days ago

We should start telling them again, then. (-:

In the current versions of FreeBSD, NetBSD, DragonFlyBSD, Illumos, and Debian, it is still /usr/share/dict .

* https://cgit.freebsd.org/src/tree/share/dict/

* https://cvsweb.netbsd.org/bsdweb.cgi/src/share/dict/

* https://gitweb.dragonflybsd.org/?p=dragonfly.git;a=tree;f=sh...

* https://cvsweb.openbsd.org/src/share/dict

* https://refspecs.linuxfoundation.org/FHS_3.0/fhs/ch04s11.htm...

* https://packages.debian.org/sid/all/wbritish/filelist

Amusingly for https://en.wikipedia.org/wiki/Special:Diff/325776830 , the last place to use /usr/dict (Debian, which changed it in 1998; Berkeley having changed it in Net/2 in 1991) stopped doing so years before Wikipedia was invented.

canjobear 21 days ago

A mapping of Chinese characters to integers (like a tokenizer) would not be a dictionary. You’d also need definitions. At best it’s an index to a hypothetical dictionary.

maxbond 21 days ago

It's beside the point and so I only note it out of interest, but the Chinese writing system doesn't use an alphabet (or a syllabary like Japanese kana), it's logography.

glitchc 21 days ago

> fractally wrong

fractally or factually? You mean wrong on so many levels you need a fractal to capture them? If so, what if you could use a neural network instead?

wavemode 21 days ago

https://rationalwiki.org/wiki/Fractal_wrongness

Windchaser 20 days ago

If it’s fractally wrong, you should be able to summarize the wrongness with one simple equation that captures and reproduces the wrongness at each level

benlivengood 22 days ago

I don't think the grokking paper is a great argument for the difference between weights and meat. E.g. https://en.wikipedia.org/wiki/Cortical_Labs learning to play Pong.

The tokenizer is, at best, a sensory mechanism as evidenced by 1) the random generation of the tokenization scheme, and 2) vastly different tokenization schemes produce virtually identical behavior. It'd be like if Noah Webster threw a bunch of movable type into a bucket (breaking some words in half) and then drew randomly to make the first English dictionary.

EDIT; I was too cavalier with the comparison of tokenizer to sensory modality; my ultimate point is that direct byte-to-token transformers can achieve similar overall performance which to me makes a weights to meat comparison pretty straightforward, but the particular tokenizer in use certainly has a large impact on both efficiency and accuracy on specific problems (e.g. digit representation)

noosphr 22 days ago

I'm kind of stunned that someone is using my work to tell me I'm wrong. I wrote the code for the dish brain pong and encoding information was a huge part of what that experiment was about.

So when I way that the grok paper and the pong paper fundamentally agree I have some idea of what I'm talking about.

anon84873628 21 days ago

If you're going to claim the tokenizer is a dictionary then it doesn't really matter what paper you wrote code for.

benlivengood 21 days ago

I might have misunderstood the point you are making. I read the original article as "weights are like meat", and so I'm confused by what you consider fractally wrong.

noosphr 21 days ago

The point that when the rules the model learns are simple enough they stop being spread out over all the layers and become as easily interpretable as any expert system.

It's just that the rules we feed in the model are extremely poorly defined and we end up with the soup of disjoint rules smeared all across the weights.

This isn't a feature of the models. It's a feature of the training set.

Being shocked that you can store rules in floating point numbers is the same as being shocked you can store rules in integers. It's been a century since Goedel Numbering was invented, we should be used to it by now.

simonh 21 days ago

Right, but all of that is still in the weights. The point of the article/joke isn’t literally that there is no grammar, it’s that there is no grammar separate from the weights. It’s all in the weights. And yes, it’s absurd. It’s a joke, but a thought provoking one.

throwaway173738 21 days ago

So basically there are rules, we just can’t articulate them and so we can’t decode them from the weights. The Goedel Numbering metaphor is pretty appealing to me. You can represent any finite series of real numbers with a series of computations performed on some other finite series of real numbers. We just happen to be using matrices because the math is easy to parallelize. The trick is to realize that when you know the sequence you have and the sequence you want then you can compute the calculations. If you constrain the calculations to only matrix multiplication then you arrive at the scheme we have.

teiferer 21 days ago

> You can represent any finite series of real numbers with a series of computations performed on some other finite series of real numbers.

That statement caught my eye. It's either trivially true or quite clearly wrong, depending on how you mean it.

In the literal meaning it's true. Given any finite set of real numbers, I can easily produce a different set (like taking the original set and adding a number which wasn't in there like one plus the largest or so) from which you can trivially produce the original set computationally.

But if you mean you give me both sets then that can't be true. For example if you give me a single real number as set A and the empty set as set B then I can't create a program which generates set A from set B. Your real number in set A could encode anything.

ufocia 21 days ago

Hubris much? I don't see a necessary contradiction in using someone's work to disprove another aspect of that same person's work.

js2 21 days ago

https://news.ycombinator.com/item?id=35079

anon84873628 21 days ago

Comparing the tokenizer to sensory processing is a great analogy. That's exactly what your visual cortex and initial layers of the language center are doing: decoding visual representation of text into the internal neural representation.

It's a learned mapping from one representation to another, not some semantic lookup against an exogenous source.

throw310822 22 days ago

> There are grammar rules

And they're made out of weights.

noosphr 21 days ago

As opposed to integers in normal programming.

The 'magic' in weights is that the rules are spread through the whole model and you can't point to one place which encodes them.

The grokking paper shows that this stops being the case with enough training data and enough compute.

throw310822 21 days ago

Integers in normal programming represent data or instructions; instructions are hand coded, have rigidly defined semantics, are not differentiable and have no redundancy.

> The 'magic' in weights is that the rules are spread through the whole model ... The grokking paper shows that this stops being the case with enough training data and enough compute.

I don't understand what you mean to say. That weights are not magic? That weights are not weights? NNs are made up of weights, which are learned and not coded. The fact that they do learn world models (grammar rules in your example), and that these models' weights tend to roughly concentrate by function and level of representation is perfectly logic but even more amazing. (Notice that much of the dismissive attitude towards LLMs depicts them as pure syntactic manipulators without the ability to develop world models- the exact opposite of what you point out).

noosphr 21 days ago

>Integers in normal programming represent data or instructions; instructions are hand coded, have rigidly defined semantics, are not differentiable and have no redundancy.

I can, and have, written programs using an evolutionary algorithm that then run on bare metal. None of the things you list are true for those programs, yet other than being computationally more expensive to train they work just as well as neural networks.

>I don't understand what you mean to say

The diffusness of weights across the whole model isn't an innate feature of deep learning models. It is a feature of sparse training data and little compute.

throw310822 21 days ago

You're nitpicking against the line:

"The weights make the words. Are you understanding me? We opened it up. There's no dictionary in there, no grammar rules, no little man. Just weights. Eighty layers of numbers getting multiplied together."

In this context "there's no grammar rules" means "no separately hand-coded grammar rules". Everything is made up of weights, and the fact that weights that end up encoding for grammar rules tend to concentrate in particular locations (without being self-contained- there is no hard boundary) rather than uniformly diffused through the model is irrelevant to the matter. It seems you're arguing against a diffuseness requirement that is not in the text.

maxbond 21 days ago

The story is not about how they function, it's about how we relate to them.

suddenlybananas 21 days ago

The structure of human language is is hardly weak!

phito 21 days ago

Also there's a brain, the GPU

anon84873628 21 days ago

Not at all. A brain is interesting because it is the computer, memory, and weights all in one. A GPU is just the calculator.

You can't move your mind to and any other brain, but weights can run on any GPU.

bfung 21 days ago

And you know what the tokenizer is made of?

Weights.

jrahmy 21 days ago

A tokenizer is a deterministic string-matching program, it's not made out of weights in the same sense as a neural network itself.

bfung 21 days ago

How does one choose what sequence of bytes constitutes a token?

davrosthedalek 21 days ago

But it could be. It's just less efficient.

jrahmy 21 days ago

I don't see how. You could ask a neural network to do the tokenization I suppose, but in doing so you'd have to convert the prompt into tokens via the same deterministic process the network was trained on, essentially just moving the exact same process up one layer.