| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gautamcgoel 923 days ago
	Author here. Thanks for this thoughtful comment. Regarding your first point: yes, this is sloppy writing on my part. The Kalman Filter is always mean-square optimal among all linear estimators, but as you say, it is only optimal along all causal estimators when the disturbances are Gaussian. Nice catch - I will clarify this point in a future version of the paper. Regarding your second point: yes, when H = 1 we just recover the standard Kalman Filter, and yes, when H grows large the estimate gets worse and worse, in the sense that the softmax nonlinearity includes more and more irrelevant data from the past in the estimate. The point is that in real-world problems, which are usually messy and nonlinear, we probably want H - the so called context length - to be large, because then we can take advantage of information we collected in the past to help improve decisions in the present. It just so happens that in the special case when the system is linear, this is more harmful then helpful. Here is one way to think about our result: imagine you have a Transformer which takes as input K-dimensional embeddings and context length H. You want to use this Transformer for filtering in some dynamical system. The most basic question you could ask is: if the system is linear, can you do Kalman Filtering? In other words, in the easy, linear scenario, can you match the optimal algorithm? If the answer is no, I see no reason to see why you should expect it to work in harder, nonlinear settings. We show that the answer is yes, when the system you want to filter in has roughly sqrt(K) states, and you design the embeddings appropriately. Hopefully this preliminary result will lead to a better understanding of how deep learning can improve control in the hard, nonlinear scenario.

3 comments

donquichotte 923 days ago

> I see no reason to see why you should expect it to work in harder, nonlinear settings

I'm not so sure about this, maybe this is where the ML approach could outperform (in terms of estimation accuracy, not compute time) the traditional EKF and UKF approaches, by learning the nonlinear system dynamics?

This sounds very hand-wavy, and it is, because of my lack of understanding. For me it is just not immediately clear that if an optimal algorithm for the linear case cannot be matched or outperformed, that is also necessarily the case for nonlinear dynamics.

EDIT: And as mentioned above, the KF is optimal if certain conditions hold, e.g. additive, zero-mean, Gaussian noise on state dynamics and observation. In reality, you may have a multiplicative component of the noise nor non-zero mean or fancy noise distributions, and it would be interesting to see if these can be learned.

link

namibj 923 days ago

Yeah, real world is messy. Also, the contribution/influence of ancient state in softmax is something the controller can learn, especially with a task-suitable position encoding. Though I'd not be surprised if what's IIUC called polynomial attention (essentially truncated Taylor series "FIR", just truncated later than the traditional linear convolutional time-series filter) where you do bounded-exponent non-linear (but IIUC still FFT-based, or at least, similar) response rather than infinite-exponent softmax, turns out to be more suitable.

And beyond that, a hierarchical controller: exploit tight feedback loop with a small controller, supervised, controlled, and managed by the big one that has some inference latency and would like to be batched somewhat (e.g., think a casual transformer trained to predict more than just one token into the future).

link

brosco 923 days ago

Thanks for clarifying the motivation, that makes a lot of sense.

link

johntiger1 923 days ago

You also have the AISTATS paper prep page at the very bottom :)

link