|
|
|
|
|
by thesz
6 days ago
|
|
> doesn't that mean we're approaching optimality?
No.Transformers are Markov chains [1]. Somewhere around this fascinating site [2] I read that stateful models have an advantage. Author provided an example, a state machine with two states A and B, where at state A transitions are to state A (output 0) and to state B (output 1) with equal probability and at state B the transition is always to state A and output is always 1. For this state machine just one bit of memory can make an optimal prediction that ones always go in pairs, whereas Markov chain will approximate this prediction and never reach optimality. [1] https://arxiv.org/abs/2410.02724
[2] https://bactra.org/
|
|