| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by behnamoh 988 days ago

I'm not sure it's even helpful to talk about transformer at all[0]. Modern LLMs are different than the pure transformer architecture in many detailed aspects. The elephant in the room being that they're not even full transformer—only the decoder part.

[0]: One reason: Never once did I need to know the transformer architecture in order to be able to use these models (prompt engineering, chaining, working with local models, etc.).

I argue that the knowledge of concepts such as ROPE, Mirostat, monkeypatching, etc. is much more crucial than knowing how transformer models work.

> I tend to stick with the higher level explanation that they can predict the next word (or next sentence) based on their training text,

I think the same way, but I think it reduces LLMs into "black boxes"—many other models can also predict next tokens based on probabilities. I think we need something that at least captures the general mechanism by which LLMs predict the next token.

2 comments

jordanpg 988 days ago

It's adorable to see how many intro writings on the subject throw up the famous "Attention Is All You Need" diagram as if that explains anything whatsoever.

link

ulber 988 days ago

>I think we need something that at least captures the general mechanism by which LLMs predict the next token.

This bit varies a lot since the capabilities involved in prediction depend on the data. If the text is a math book and the prompt is "... three plus five apples is a total of " the crucial capability is arithmetic (plus of course NL capabilities). On the other hand, if you're completing a post from /r/relationship_advice the capabilities involved will be (vaguely) maintaining literary tone, theory of mind, psychology, etc. Within a text the capabilities needed will also vary a lot, where you might need theory of mind at some crucial inflection points, but most of the time its already clear what is going to be said and the model just has to get the wording right.

So, my take would be to really think hard about the data to understand how predictions might be made.

link

cowb0yl0gic 988 days ago

The way LLMs process text is a lot like a typical human (top-of-the-head answer vs actually analyzing the question in depth). This is a recurring pattern with the current LLMs I've interacted with (ex., A: "x" Q: "but what about y?" A: "oh, you're right, now that I [actually] think about it, really z"). Example (Anthropic's Claude):

    Q: if one basket contains 5 apples and another basket contains 3 plums, and another basket contains 2 walnuts how many fruits are there combined?
    A: The first basket contains \*5 apples\*. The second basket contains \*3 plums\*. The third basket contains \*2 walnuts\*. Adding up the numbers of fruits in each basket, there are a total of \*5 + 3 + 2 = 10 fruits combined\*.
    Q: is a walnut a fruit?
    A: You're right, a walnut is not technically a fruit. A walnut is a type of tree nut. To correctly answer the original question:
    The first basket contains 5 apples. The second basket contains 3 plums. While the third basket contains 2 walnuts, walnuts are not fruits. So the total number of fruits combined is 5 apples + 3 plums = 8 fruits.

link

PeterisP 987 days ago

An interesting fundamental limitation of transformer LLMs is that their architecture doesn't allow for the option for the model to choose to think a bit more or to choose to review their answer, as the quantity and structure computations they'll do is fixed in the architecture without any conditionals - but you repeating the same prompt means that the model gets to run a second iteration while having the result of the first iteration as the input data, and that can enable it to do more.

Of course, if we'd want to generalize it, we should have a structure where the model itself can actually make the decision to continue work for one more iteration, and also ensure that it can retain some of the temporary work-in-progress notes (currently it can see only what the previous iteration output as part of the answer).

link

simonw 987 days ago

There's an interesting new paper about this problem: https://arxiv.org/abs/2310.02226

"Think before you speak: Training Language Models With Pause Tokens"

Basic idea is to teach the LLM to occasionally insert a "pause" token, which outputs nothing but gives it a chance to perform another round of operations on the way to the answer.

link

simonw 988 days ago

"three plus five apples is a total of..." is a really interesting example, because it doesn't actually require arithmetic at all.

A language model trained on enough text will be able to complete this just based on having encountered the pattern "three plus five SOMETHING is a total of..." enough times in its training data.

This becomes even more apparent when you work with smaller models - the 7B etc models which can run on a laptop. They can often solve small arithmetic problems like this while having no chance at all of working with larger numbers that they haven't ever encountered in their training data.

I really like using those smaller models as tools to better understand how this technology works.

link