Hacker News new | ask | show | jobs
by Straw 840 days ago
Its distinct, but not very- its an EMA without assuming uniform time. The stability of EMA has nothing to do with integrators in control theory and neither do these models.

These models aren't really RNNs- they have only a linear gate which cannot depend on previous tokens at this layer, so they cant update their state in a way which depends on the current state very much.