|
|
|
|
|
by behnamoh
988 days ago
|
|
I'm not sure it's even helpful to talk about transformer at all[0]. Modern LLMs are different than the pure transformer architecture in many detailed aspects. The elephant in the room being that they're not even full transformer—only the decoder part. [0]: One reason: Never once did I need to know the transformer architecture in order to be able to use these models (prompt engineering, chaining, working with local models, etc.). I argue that the knowledge of concepts such as ROPE, Mirostat, monkeypatching, etc. is much more crucial than knowing how transformer models work. > I tend to stick with the higher level explanation that they can predict the next word (or next sentence) based on their training text, I think the same way, but I think it reduces LLMs into "black boxes"—many other models can also predict next tokens based on probabilities. I think we need something that at least captures the general mechanism by which LLMs predict the next token. |
|