Afaik you're describing something akin to a recurrent neural network, and the problem with that is that it doesn't parallelize well to modern hardware. And vanishing gradients.
I had the same thought as the comment you're responding to.
Recurrent neural networks are bad when the recurrence is 100x long or more. You need long chains because with a token-at-a-time, that's what you need to process even one paragraph.
But if you use an RNN around a Transformed-based LLM, then you're adding +4K or +8K tokens per recurrence, not +1.
E.g.: GPT 4 32K would need just 4x RNN steps to reach 128K tokens!
Recurrent neural networks are bad when the recurrence is 100x long or more. You need long chains because with a token-at-a-time, that's what you need to process even one paragraph.
But if you use an RNN around a Transformed-based LLM, then you're adding +4K or +8K tokens per recurrence, not +1.
E.g.: GPT 4 32K would need just 4x RNN steps to reach 128K tokens!