| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by clementneo 1215 days ago

Co-author here! I'm kind of surprised that this made it to the top of HN! This was a project in which Joseph and I tried to reverse engineer the mechanism in which GPT-2 predicts the word 'an'.

It's crazy that large language models work so well just by being trained as a next-word-prediction model over a large amount of text data. We know how image models learn extract the features of an image through convolution[1], but how and what LLMs learn exactly remain a black box. When we dig deeper into the mechanisms that drive LLMs, we might get closer to understanding why they work so well in some senses, and why they could be catastrophic in other cases (see: the past month of search-based developments).

I find trying to understand and reverse-engineer LLMs to be a personally exciting endeavour. As LLMs get better in the near future, I sure hope our understanding of them can keep up as well!

[1] https://distill.pub/2020/circuits/zoom-in/

6 comments

HarHarVeryFunny 1214 days ago

I wonder if you could comment on this (related to question of how far ahead these "LLM"s are planning).

This is Wharton professor Ethan Mollick playing with the new Bing chat, which seems considerably more advanced than ChatGPT (based on GPT-4 perhaps?).

Here he asks it to write something using Kurt Vonnegut's rules of writing.

https://twitter.com/emollick/status/1626084142239649792

It seems hard to explain how Bing/GPT could have generated the Vonnegut-inspired cake story, having ingested the rules, without planning the whole thing before generating the first word.

It seems there's an awful lot more going on internally in these models than a mere word by word autoregressive generation. It seems the prompt (in this case including Vonnegut's rules) is ingested and creates a complex internal state that is then responsible for the coherency and content of the output. The fact that it necessarily has to generate the output one word at a time seems to be a bit misleading in terms of understanding when the actual "output prediction" takes place.

gbasin 1214 days ago

There is "long range" dependence, it's just only on the prompt: the conversation with the user and the hidden header (e.g. "Answer as ChatGPT, an intelligent AI, state your reasons, be succinct, etc."). That ends up being enough.

HarHarVeryFunny 1214 days ago

Sure, but the point being discussed is that despite the word by word output, the output does not appear to be "chosen" on a word by word basis. OP investigated the case where the word "an" anticipates the following word ("an apple" vs "a pear").

sharemywin 1214 days ago

I see 2 options:

1. we don't know what they(coding layer between bing and GPT) look up and store as a prompt aka working memory.

2. it can do the equivalent of receiving it's own prompt silently.

I seen with code it outputs the step for the code then writes the code.

so there's some kind of plan and execute going on. maybe it can do that in model some how

hackinthebochs 1214 days ago

>so there's some kind of plan and execute going on. maybe it can do that in model some how

The simple answer is that the internal state that picks the next token is stable over iterations so that the model can follow a consistent plan over multiple token outputs. Then as the plan "unfolds" in the output tokens, these tokens help stabilize the plan further, thus creating consistency over long generations.

psychphysic 1214 days ago

Its chosen by the ngram and randomly so, that does suggest it is completing the text a word at a time.

HarHarVeryFunny 1214 days ago

Did you check the Vonnegut writing rules example I posted at top of this thread - in particular look at Bing/GPT's explanation of how its cake story matches up to Vonnegut's rules ? It's hard to imagine how it could have come up with such a coherent story, checking all the rules, if it was only conceiving of it's continuing story on a word by word basis. It's not as if sentence #1 matches rule number 1, sentence 2 matches rule number 2, etc. It seems there had to be some wholistic composition for it to do that.

Note too that despite the output being sampled from a distribution based on a "randomness" temperature, there are many case where what it is trying to say so much constrains the output that certain words/synonyms/concepts are all but forced.

theGnuMe 1214 days ago

Kurt Vonnegut is a conditional sub space of the embedding vectors.

hackinthebochs 1214 days ago

It's easy to see that its not just doing one token at a time but is anticipating future tokens. Consider the context of a Q&A. The response might start with any of a number of words, exactly which word depends on what comes after. But if it randomly chooses the wrong word, it will either be forced to complete the wrong answer, or be backed into a corner and engage in circumlocutions to course-correct. This doesn't happen in practice for recent big models.

mungoman2 1215 days ago

Convolution is part of the network design though. Would a fully connected network learn to convolute? Or would it turn out that convolution is not necessary?

nerdponx 1215 days ago

The interesting part here isn't the convolution itself, it's how convolutional layers turn out to like "filters" or "detectors" for individual features. This is explained very well in the distill.pub article linked by GP.

We know the architecture of LLMs because we created it, but we don't yet have the same level of understanding about them, or the same quality of analytical tools for reasoning about them.

xmcqdpt2 1214 days ago

They do and in fact it's relatively straightforward to show empirically on eg MNIST. The problem is that you need a much much larger network in the FCN case and thus need way more data and way more data augmentation to get a good result that isn't overfit to hell.

In the case of CNN the reason it works is that an image of an object X is still an image of object X if the X is shifted left or right. The property is translationally invariant. CNN are basically the simplest way to encode translational invariance.

candiodari 1214 days ago

> CNN are basically the simplest way to encode translational invariance

That's the geometric deep learning theory, isn't it? Do you know if there's a list somewhere of exactly what invariance has which ways to simulate it? Like an overview?

redox99 1215 days ago

Yes it would, or at least a similar operation.

The point of using a CNN instead of a FCN is that you force it to learn in a certain way that prevents overfitting. But given a sufficient dataset, and proper data augmentation you would expect a FCN to be able to identify objects regardless of translation. It's just that a CNN would train easier and better, with a smaller network (a FCN doing convolutions would be very wasteful).

That's why traditionally you would pick your architecture to help it learn in a certain way (images=cnn, text=rnn/lstm/gru). But the nice thing about transformers is that they are more general.

ly3xqhl8g9 1215 days ago

Could a "type system" for neural weights be developed? Given a self-driving system, to be able to statically check that the neurons have the "Person" type, the "Don't Run Over Person" type, and so forth. What happens if you "transplant" the weights for ' an' to another network, some kind of transfer learning but componentized, does it still predict as accurately? If neural networks could be assembled from "types" it would be much easier to trust them.

simonh 1214 days ago

The way an LLM decides which word to use next is by evaluating the weightings of all the preceding words with every candidate word to calculate a probability for each of them. So if it selects ‘an’ as the next word, it’s because the weighting connecting ‘an’ to all the preceding words, and their orders in the text and relationships with each other predicted it should have a high probability of occurring.

So you can’t extract the weightings for ‘an’ discretely because those weightings encode its connection with all the other words and combinations and sequences or clusters of words it might ever be used with, including their weightings with other preceding words, and their relationships, etc, etc.

ly3xqhl8g9 1214 days ago

Right, but if there is such a thing as the very plastically named "Jeniffer Aniston neuron" [1], and further more, group equivariant deep learning [2], maybe there is a way in which you can isolate a certain concept/"type", such as Person, Car, and so forth; perhaps not even isolate, but rehydrate the context of where the concept takes place: as a brain does in various word plays, as in Who's on First [3], etc.

Come to think of it, when someone teaches me a new concept, the principle of mass conservation, for instance, in some sense they are transferring their embedding into my brain, further on I will relate to mass conservation through what that person taught me. The transfer is a very lossy process, sure, but a transfer with reintegration nonetheless. Perhaps "mortal computation" [4] is a requirement.

[1] https://en.wikipedia.org/wiki/Grandmother_cell

[2] https://www.youtube.com/playlist?list=PL8FnQMH2k7jzPrxqdYufo...

[3] https://www.youtube.com/watch?v=kTcRRaXV-fg

[4] Geoffrey Hinton, The Forward-Forward Algorithm: Some Preliminary Investigations, chapter 8, https://www.cs.toronto.edu/~hinton/FFA13.pdf

simonh 1214 days ago

> Right, but if there is such a thing as the very plastically named "Jeniffer Aniston neuron"

Firstly even if there is such a cell that only fires for one face, or perhaps also the person’s name, it doesn’t mean there aren’t other cells that fire for that person, or for people in general including that person. Without those as well, that neurons responses might not mean anything to the rest if the brain. It’s a thought experiment but never really demonstrated.

Also even if this is true in the very strongest sense. Say there is one neuron that uniquely and discretely fires in response to thinking about that one person. What defines a neuron isn’t just its internal behaviour. It’s also the pattern of inputs that influence it, and the pattern of outputs it sends out. It’s the connections and dependencies on the weightings and signals and responses from all the cells it’s connected to. Including the specific unique ways all those neurons are connected, or not connected to all the other cells in the brain. It’s al, the specifics of that connectedness that are what makes the behaviour of that neuron meaningful.

If you took that neuron and implanted it into another brain, you’d need to hook it up to the neurons in that brain such that it gets exactly the same stimuli, in the same order, with the same strength, every time it needs to fire. The same applies to its output, all the neurons it’s connected to would have to interpret its firing behaviour in the exact same way the other neurons in the original brain did. But there’s no guarantee any of those connected mechanisms work or are physically connected in the same way, or even a vaguely similar or compatible way in the new brain.

fennecfoxy 1214 days ago

Well, given the more organic nature of machine learning and what it's trying to achieve I wouldn't be surprised if that same neuron also triggered to some degree for "Jennifer and Stefan" ahaha.

jerpint 1215 days ago

Do you think it would ever be possible to “maximize” a neuron with certain sentences? What’s so different with the gradient ascent techniques with convolutions?

bilsbie 1214 days ago

Near work! I’m still confused how it knows to use “an” if it hasn’t chosen the word after it yet?

sharemywin 1214 days ago

you might find this paper interesting:

https://arxiv.org/abs/2202.05262

Locating and Editing Factual Associations in GPT

dpaleka 1214 days ago

That paper (ROME) was the most famous paper in the field last year :)

See also new interesting developments breaking the connection between "Locating" and "Editing":

https://arxiv.org/abs/2301.04213

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models