Hacker News new | ask | show | jobs
by namibj 918 days ago
Yeah, real world is messy. Also, the contribution/influence of ancient state in softmax is something the controller can learn, especially with a task-suitable position encoding. Though I'd not be surprised if what's IIUC called polynomial attention (essentially truncated Taylor series "FIR", just truncated later than the traditional linear convolutional time-series filter) where you do bounded-exponent non-linear (but IIUC still FFT-based, or at least, similar) response rather than infinite-exponent softmax, turns out to be more suitable.

And beyond that, a hierarchical controller: exploit tight feedback loop with a small controller, supervised, controlled, and managed by the big one that has some inference latency and would like to be batched somewhat (e.g., think a casual transformer trained to predict more than just one token into the future).