Hacker News new | ask | show | jobs
by polotics 560 days ago
you have described an RNN I think, don't attention heads add something that you could compare to rough &ready understanding?
1 comments

Auto-regressive LLMs do this as I understand it, though it can vary if they feed the combined input and output[1] through the whole net like GPT-2 and friends, or just the decoder[2]. I described the former, and I should have clarified that.

In either case you can "prime it" like it was suggested.

A regular RNN has more feedback[3], like each layer feeding back to itself, as I understand it.

Happy to be corrected though.

[1]: https://jalammar.github.io/illustrated-gpt2/#one-difference-...

[2]: https://medium.com/@ikim1994914/understanding-the-modern-llm...

[3]: https://karpathy.github.io/2015/05/21/rnn-effectiveness/