Observer Optimization

Now that we have some tools to study noise processes and their effects on state observers, it is time to return to the variance propagation equations, to see what they can tell us about the observer gains that will give the best tracking of the system state.

As always when discussing optimizations... optimizing what, exactly? In this case, the goal is a Wiener optimum — that is, one that minimizes the sum-of-squares error in the system response. That is, it minimizes the variance in the internal state noise, given the information observable in the input and output sequences.

"Completing the Square" lemma

We are going to need another peculiar little formula, to be used in just a moment.

Do you remember the completing the square technique from secondary school algebra for factoring polynomials? There is a similar kind of formulation for matrices. It takes an expression where certain terms appear both in first-order and second-order forms, and allows them to be recombined so that those terms exist only in second-order form.

For compatible matrix terms α and β:

- α β^{T} - β α^{T} + α α^{T} \Leftrightarrow [α - β] \cdot [α - β]^{T} - β β^{t}

Proof: expand the right side to sum-of-products form. Observe that some of the terms cancel, and what remains is the desired result.

(α α^{T} - α β^{T} - β α^{T} + β β^{T}) - β β^{T}

α α^{T} - α β^{T} - β α^{T}

Variance Propagation in an Observer

In previous examinations of the propagation of random variations through the observer equations, we reached the following point.

e_{i + 1} = (I - K C) A e_{i} + (- K) v_{i} + (I - K C) w_{i}

where:

e is the variance in the observer state estimates
w is a variable representing noise disturbances coupled into the state
v is a variable representing noise disturbances affecting observations
A is the system state transition matrix 
C is the observation matrix relating system state to outputs
K is a matrix of observer gains for the state corrections
B is the system input coupling matrix, omitted

Covariance matrices were defined for the random vector terms.

P \Leftrightarrow E (e e^{T})

Q \Leftrightarrow E (w w^{T})

R \Leftrightarrow E (v v^{T})

Using variance properties for additivity and transformations, we can derive from the state transition equations above the corresponding variance propagation equations.

P^{i + 1} = (I - K C) A P^{i} A^{T} {(I - K C)}^{T} + K R K^{T} + (I - K C) Q {(I - K C)}^{T}

The value of the observer gain matrix K is under our control. The variance propagation equation says what will happen as a consequence of that choice, propagating over time to produce the next state variance P.

What we would like to do is select choose the value of gain matrix K in such a way that the theoretically minimum amount of noise propagates to the next time step.

Restructuring the equations

First make things more messy by expanding terms of the variance propagation equations.

P^{i + 1} = A P^{i} A^{T} - K C A P^{i} A^{T} - A P^{i} A^{T} K C^{T} + K C A P^{i} A^{T} C^{T} K^{T}

+ Q - K C Q - Q C^{T} K^{T} + K C Q C^{T} K^{T} + K R K^{T}

P^{i + 1} = (A P^{i} A + Q) - K C (A P^{i} A + Q) - (A P^{i} A + Q) C^{T} K^{T}

+ K C (A P^{i} A + Q) C^{T} K^{T} + K R_{i} K^{T}

Now define two simplifying notations.

W \Leftrightarrow (A P A^{T} + Q)

X \Leftrightarrow (C W C^{T} + R)

After introducing these notations, things do no look nearly so bad. The variance propagation equations can be rewritten as follows.

P_{i + 1} = W + K C W + W C^{T} K^{T} + K X K^{T}

But we're in a little trouble here. We have an expression that is quadratic in matrix terms, a Ricatti Equation. We can describe the term on the left as being independent of gain matrix K, the term on the right as squared dependent on matrix K, and the remaining terms in the middle are linearly dependent on matrix K. We are now going to apply the "complete the square" lemma.

Since X is a variance expression, it has the necessary symmetry properties to allow it to be factored into lower and upper triangular factors using a Choleski decomposition.

X \Leftrightarrow (y y^{T})

By picking the α term appropriately, it is very easy to make this term of the lemma correspond to the "quadratic" term in the reconstituted variance expression. Then, we can accordingly pick the β term so that the lemma matches the two "linear" terms in the expression above.

α \Leftrightarrow K y

β \Leftrightarrow W C^{T} {(y^{T})}^{- 1}

α β^{T} = (K y) (y^{- 1} C W) = K C W

The fully expanded form of the "complete the square lemma" with these two substitutions is then as follows.

- W C^{T} X^{- 1} C W + [K y - W C^{T} {(y^{T})}^{- 1}] \cdot [K y - W C^{T} {(y^{T})}^{- 1}]^{T} = - K C W - W C^{T} K^{T} + K y y^{T} K^{T}

Making a massive substitution of this "complete the square" equivalence into the previous variance propagation expression yields the following.

P_{i + 1} = W - W C^{T} X^{- 1} C W + [K y - W C^{T} {(y^{T})}^{- 1}] \cdot [K y - W C^{T} {(y^{T})}^{- 1}]^{T}

Whew! This expression is a little bit more complicated than what we had originally, but it has two special features: an additional "independent of K" term, and no terms "linear in K." The serious work is done.

Now the path to an optimum is a clear. There is nothing that adjustments to K can do to improve the terms that are independent of K. There are no linear in K terms to worry about. If we can find a value of K that makes the squared dependency part go to zero, that is the best possible. Picking the following, and substituting this back into the variance propagation expression, it is clear that the desired result is achieved.

K = W C^{T} {(y^{T})}^{- 1} y^{- 1}

K = W C^{T} X^{- 1}

If you use this gain for your observer, you have the Wiener-optimal observer — you have a Kalman Filter.

We are not going to take this further. To fully explore the possibilities, you need a clear and comprehensive study of Kalman Filters.^[2] Some surprising simplifications happen when terms that were once very complicated go to zero. Use those formulas carefully. They might apply ONLY under the conditions that Kalman optimal observer gains are used. For various practical reasons, you might choose to operate using gains that are sub-optimal according to pure Kalman theory because it is to your advantage to do so. Or there might be errors in your model, and correcting less aggressively is one approach to being cautious regarding this possible hazard.

[1] The term Riccati Equations originally came from the study of a family of differential equations by Jacopo Riccati (1676–1754), see the Encyclopedia of Mathematics, Springer, ISBN 978-1-55608-010-4 at a university technical library near you. Since those original studies, the meaning has been expanded to include the very similar "algebraic" quadratic equations, as seen here.

[2] And enjoy your lifetime of adventure.