|
|
|
|
|
by simonw
988 days ago
|
|
I don't think it's particularly helpful to dive into transformer models, positional encodings and self-attention at the very start of an introduction to LLMs. Understanding how those works does little to help explain what LLMs can do and how you can use them. I tend to stick with the higher level explanation that they can predict the next word (or next sentence) based on their training text, and then emphasize that while this sounds pretty limited it's actually capable of doing all sorts of impressive things once you scale it up enough. |
|
[0]: One reason: Never once did I need to know the transformer architecture in order to be able to use these models (prompt engineering, chaining, working with local models, etc.).
I argue that the knowledge of concepts such as ROPE, Mirostat, monkeypatching, etc. is much more crucial than knowing how transformer models work.
> I tend to stick with the higher level explanation that they can predict the next word (or next sentence) based on their training text,
I think the same way, but I think it reduces LLMs into "black boxes"—many other models can also predict next tokens based on probabilities. I think we need something that at least captures the general mechanism by which LLMs predict the next token.