Fading Memory in Least Squares Problems
Developing Models for Kalman Filters
In Linear Least Squares updating methods, each added constraint contributes new information to the previous Normal Equations matrix. The question is: where does the information that was previously there go?
Information accumulation in Normal Equations
If you have been following this series closely, you know the answer already. It goes nowhere. It is all there. Each sequential update adds to the Normal Equations matrix, in the sense of making the data set larger, but nothing is removed.
Let's think about the consequences of this. Look again at the basic sequential updating formula for the Normal Equations system.
$$\left({{m}_{n+1}}^{T}{m}_{n+1}+{{M}_{n}}^{T}{M}_{n}\right)\xb7\left({p}_{n}+\Delta p\right)\phantom{\rule{0.5em}{0ex}}=\phantom{\rule{0.5em}{0ex}}\left({{m}_{n+1}}^{T}\xb7{y}_{n+1}+{{M}_{n}}^{T}\xb7{Y}_{n}\right)$$ You can see that on the left side, the Normal Equations matrix starts
from what it was previously, and something new is added. That something
new has the form m^{T}·m
, a rank-one matrix. The
rank-one matrix will have squared terms on its main diagonal; thus,
the main diagonal terms in the M^{T}M
matrix can
grow but not decrease. Magnitudes on the right side grow in proportion
to the M^{T}
matrix.
Suppose that you have done this for 10 million update steps. Roughly speaking, you can expect the magnitude of the Normal Equation matrix to be about 10 million time larger than the magnitude of the next new rank-one update to be included. At some point, the effect of each new update is very much like the proverbial drop-in-the-bucket.
Equivalent contribution of past updates
This is a little informal, but you can think of each new constraint added to the Normal Equations system as having one unit of influence on the outcome of the Least Squares fitting process. After 10 million constraints, the Normal Equation matrix should embody roughly 10 million units of influence.
Suppose that a row-scaling operation, as discussed in the previous installment, is selected so that the every diagonal term of the weighting matrix has the value 1/1000. An equivalent effect is achieved by multiplying the entire matrix times scalar constant 1/1000. We know that row scaling can shift the "relative importance" of the constraint conditions,^{[1]} but since the "importance" of every constraint is reduced equally, the balance is the same, so the best-fit solution does not change.
But now consider what happens when a new update is added to the scaled constraint system. The prior updates have that factor 1/1000 lower combined influence, but the new update does not. Consequently, the new update has 1000 times more impact on the calculated result than it would have without scaling the previous matrix terms.
This kind of behavior — treating new updates as more relevant than the past ones — is needed for the special case of a model that is able to adapt is behavior to current conditions.
Fading memory updates
We don't have to wait for a large number of updates to apply the scaling adjustments in a large lump, as in the previous example. We can apply the scaling incrementally along the way, allowing the effects to compound over time.
Suppose for example that the goal is to maintain normal equations in a form that looks like the combined influence of approximately 10000 updates. Let's try the strategy of applying a scaling factor of 9999/10000 to every term in the previous Normal Equations matrix and to every term of the previous right-hand side vector before applying the next update.
Consider the cumulative effects.
- The first update has seen no scaling, and the Normal Equations embody one full "unit of influence."
- On the next update, the influence of that term will be reduced
by the "fading" factor of
w = 9999/10000
. But then a new constraint is added in. So the effective number of "units of influence" contributing to the Normal Equations system becomes1.0 + 9999/1000
after two updates. - After the next update, the previous two updates both receive
another "fading factor"
9999/10000
, so the cumulative influence after updates for all three constraints is1.00 + (9999/10000) + (9999/10000)^{2}
.
Continue this for a very large number of updates — for all practical purpose, "approaching an infinite number." We can get a really good idea of what the combined influence is going to be from studying a geometric series:
for 0 < w < 1.0, lim _{k → ∞} (1 + w + w^{2} + ...) → 1 / (1-w)
Applying this to the case of w = 9999/10000
, we get the
encouraging result that the "units of influence" number settles to the
equivalent of 10000 updates.
With each update, the effect of a past constraint gets smaller and
smaller, in a manner that decays like an exponential function.
After 10000 updates with weighting factor w = 9999/10000
,
the influence of the initial constraint reduces to about 1/3 of what
it was originally.
(9999/10000)^{10000} = 0.36786
The relationship to the natural base of the logarithms is no accident.
e^{-1} = 0.36788
Based on the behavior of exponential functions, we can project that
after 50000 updates with a weighting factor w = 9999/10000
,
the influence of the first constraint has attenuated to the point that
it is negligible.
The folks in the statistical modeling field call this kind of process exponential filtering. The folks in adaptive systems call it fading memory. Since exponential decays are what is expected from a theoretical linear first-order dynamic system response, the general systems theorists call this a first-order lag.
Adaptive Least-Squares Solution with Fading Memory
Because the fading memory formulation is a relatively important
one, let's formalize the details. These results all derive from
substituting weighted equation terms W M
and W y
into the basic Linear Least Squares problem and tracking the corresponding
results in the Normal Equations and in the updating formulas. The following
notations are used:
- K ↔ number of equivalent update contributions to preserve
- w ↔ the derived "forgetting factor weight"
(K-1)/K
- W_{n} ↔
n x n
square weighting factor matrix consisting of all zero terms off-diagonal, and the terms1
,w
,w^{2}
,w^{3}
, etc., in the diagonal terms for successive rows. For an unweighted system, W ↔ I, an identity matrix. - n ↔ the number of constraint updates included
- m_{n} ↔ the row vector of new independent constraint
equation data introduced at step
n
. - M_{n} ↔ the non-square matrix of constraint equations for
the Least Squares solution to attempt to satisfy, with one column for each
parameter to be determined, and
n
rows, one for each constraint. - N_{n} ↔ square matrix
M^{T}M
constructed from the Least Squares matrix M after inclusion ofn
constraints and applying weighting matrixW
. - y_{n} ↔ the new dependent (observed output) value
introduced at step
n
. - Y_{n} ↔ the vector of accumulated dependent values
as accumulated through step
n
. - z_{n} ↔ the vector of weighted dependent values
as accumulated through step
n
. - p ↔ the column vector of parameter values to be calculated.
The following equations summarize the Linear Least Squares problem and related updating schemes after applying "fading memory" weighting.
- Weighted Least Squares best fit problem.
$$W\xb7M{}_{n}\xb7p\phantom{\rule{1em}{0ex}}=\phantom{\rule{1em}{0ex}}W\xb7{y}_{n}$$ - Weighted Normal Equations.
$${M{}_{n}}^{T}\xb7{W}^{2}\xb7M{}_{n}\xb7p\phantom{\rule{1em}{0ex}}=\phantom{\rule{1em}{0ex}}{M{}_{n}}^{T}\xb7{W}^{2}\xb7{y}_{n}$$ $$N{}_{n}\xb7p\phantom{\rule{1em}{0ex}}=\phantom{\rule{1em}{0ex}}{z}_{n}$$ - Weighted Normal Equations solution.
$$p\phantom{\rule{1em}{0ex}}=\phantom{\rule{1em}{0ex}}{\left(N{}_{n}\right)}^{-1}\xb7{z}_{n}$$ - Weighted Normal Equations updating formulas.
$$N{}_{n+1}\phantom{\rule{1em}{0ex}}=\phantom{\rule{1em}{0ex}}{M{}_{n+1}}^{T}\xb7{W}^{2}\xb7M{}_{n+1}\phantom{\rule{1em}{0ex}}=\phantom{\rule{1em}{0ex}}{{m}_{n+1}}^{T}{m}_{n+1}\phantom{\rule{1em}{0ex}}+\phantom{\rule{1em}{0ex}}{W}^{2}\xb7N{}_{n}$$
$${z}_{n+1}\phantom{\rule{1em}{0ex}}=\phantom{\rule{1em}{0ex}}{M{}_{n+1}}^{T}\xb7{W}^{2}\xb7{y}_{n+1}\phantom{\rule{1em}{0ex}}=\phantom{\rule{1em}{0ex}}{m{}_{n+1}}^{T}{y}_{n+1}\phantom{\rule{1em}{0ex}}+\phantom{\rule{1em}{0ex}}{W}^{2}\xb7{z}_{n}$$
Weighted updating is important if model parameter values are expected to "drift" gradually over time. On the other hand, it is a bad idea to "forget" older information if it is just as relevant as newer information. In fact, if you have a long enough history of relevant older information, maybe further updating is pointless.
This consideration will be revisited in the context of Kalman Filtering.
Footnotes:
[1] See the previous article in this series.