|
|
|
|
|
by gautamcgoel
923 days ago
|
|
Author here. Thanks for this thoughtful comment. Regarding your first point: yes, this is sloppy writing on my part. The Kalman Filter is always mean-square optimal among all linear estimators, but as you say, it is only optimal along all causal estimators when the disturbances are Gaussian. Nice catch - I will clarify this point in a future version of the paper. Regarding your second point: yes, when H = 1 we just recover the standard Kalman Filter, and yes, when H grows large the estimate gets worse and worse, in the sense that the softmax nonlinearity includes more and more irrelevant data from the past in the estimate. The point is that in real-world problems, which are usually messy and nonlinear, we probably want H - the so called context length - to be large, because then we can take advantage of information we collected in the past to help improve decisions in the present. It just so happens that in the special case when the system is linear, this is more harmful then helpful. Here is one way to think about our result: imagine you have a Transformer which takes as input K-dimensional embeddings and context length H. You want to use this Transformer for filtering in some dynamical system. The most basic question you could ask is: if the system is linear, can you do Kalman Filtering? In other words, in the easy, linear scenario, can you match the optimal algorithm? If the answer is no, I see no reason to see why you should expect it to work in harder, nonlinear settings. We show that the answer is yes, when the system you want to filter in has roughly sqrt(K) states, and you design the embeddings appropriately. Hopefully this preliminary result will lead to a better understanding of how deep learning can improve control in the hard, nonlinear scenario. |
|
I'm not so sure about this, maybe this is where the ML approach could outperform (in terms of estimation accuracy, not compute time) the traditional EKF and UKF approaches, by learning the nonlinear system dynamics?
This sounds very hand-wavy, and it is, because of my lack of understanding. For me it is just not immediately clear that if an optimal algorithm for the linear case cannot be matched or outperformed, that is also necessarily the case for nonlinear dynamics.
EDIT: And as mentioned above, the KF is optimal if certain conditions hold, e.g. additive, zero-mean, Gaussian noise on state dynamics and observation. In reality, you may have a multiplicative component of the noise nor non-zero mean or fancy noise distributions, and it would be interesting to see if these can be learned.