Fading memory

In Linear Least Squares updating methods, each added constraint contributes new information to the previous Normal Equations matrix. The question is: where does the information that was previously there go?

Information accumulation in Normal Equations

If you have been following this series closely, you know the answer already. It goes nowhere. It is all there. Each sequential update adds to the Normal Equations matrix, in the sense of making the data set larger, but nothing is removed.

Let's think about the consequences of this. Look again at the basic sequential updating formula for the Normal Equations system.

({m_{n + 1}}^{T} m_{n + 1} + {M_{n}}^{T} M_{n}) \cdot (p_{n} + Δ p) = ({m_{n + 1}}^{T} \cdot y_{n + 1} + {M_{n}}^{T} \cdot Y_{n})

You can see that on the left side, the Normal Equations matrix starts from what it was previously, and something new is added. That something new has the form m^T·m, a rank-one matrix. The rank-one matrix will have squared terms on its main diagonal; thus, the main diagonal terms in the M^TM matrix can grow but not decrease. Magnitudes on the right side grow in proportion to the M^T matrix.

Suppose that you have done this for 10 million update steps. Roughly speaking, you can expect the magnitude of the Normal Equation matrix to be about 10 million time larger than the magnitude of the next new rank-one update to be included. At some point, the effect of each new update is very much like the proverbial drop-in-the-bucket.

Equivalent contribution of past updates

This is a little informal, but you can think of each new constraint added to the Normal Equations system as having one unit of influence on the outcome of the Least Squares fitting process. After 10 million constraints, the Normal Equation matrix should embody roughly 10 million units of influence.

Suppose that a row-scaling operation, as discussed in the previous installment, is selected so that the every diagonal term of the weighting matrix has the value 1/1000. An equivalent effect is achieved by multiplying the entire matrix times scalar constant 1/1000. We know that row scaling can shift the "relative importance" of the constraint conditions,^[1] but since the "importance" of every constraint is reduced equally, the balance is the same, so the best-fit solution does not change.

But now consider what happens when a new update is added to the scaled constraint system. The prior updates have that factor 1/1000 lower combined influence, but the new update does not. Consequently, the new update has 1000 times more impact on the calculated result than it would have without scaling the previous matrix terms.

This kind of behavior — treating new updates as more relevant than the past ones — is needed for the special case of a model that is able to adapt is behavior to current conditions.

Fading memory updates

We don't have to wait for a large number of updates to apply the scaling adjustments in a large lump, as in the previous example. We can apply the scaling incrementally along the way, allowing the effects to compound over time.

Suppose for example that the goal is to maintain normal equations in a form that looks like the combined influence of approximately 10000 updates. Let's try the strategy of applying a scaling factor of 9999/10000 to every term in the previous Normal Equations matrix and to every term of the previous right-hand side vector before applying the next update.

Consider the cumulative effects.

The first update has seen no scaling, and the Normal Equations embody one full "unit of influence."
On the next update, the influence of that term will be reduced by the "fading" factor of w = 9999/10000. But then a new constraint is added in. So the effective number of "units of influence" contributing to the Normal Equations system becomes 1.0 + 9999/1000 after two updates.
After the next update, the previous two updates both receive another "fading factor" 9999/10000, so the cumulative influence after updates for all three constraints is 1.00 + (9999/10000) + (9999/10000)².

Continue this for a very large number of updates — for all practical purpose, "approaching an infinite number." We can get a really good idea of what the combined influence is going to be from studying a geometric series:

   for  0 < w < 1.0,    lim _{k → ∞}  (1 + w + w² + ...)  →   1 / (1-w)

Applying this to the case of w = 9999/10000, we get the encouraging result that the "units of influence" number settles to the equivalent of 10000 updates.

With each update, the effect of a past constraint gets smaller and smaller, in a manner that decays like an exponential function. After 10000 updates with weighting factor w = 9999/10000, the influence of the initial constraint reduces to about 1/3 of what it was originally.

   (9999/10000)¹⁰⁰⁰⁰ = 0.36786

The relationship to the natural base of the logarithms is no accident.

   e^-1 = 0.36788

Based on the behavior of exponential functions, we can project that after 50000 updates with a weighting factor w = 9999/10000, the influence of the first constraint has attenuated to the point that it is negligible.

The folks in the statistical modeling field call this kind of process exponential filtering. The folks in adaptive systems call it fading memory. Since exponential decays are what is expected from a theoretical linear first-order dynamic system response, the general systems theorists call this a first-order lag.

Adaptive Least-Squares Solution with Fading Memory

Because the fading memory formulation is a relatively important one, let's formalize the details. These results all derive from substituting weighted equation terms W M and W y into the basic Linear Least Squares problem and tracking the corresponding results in the Normal Equations and in the updating formulas. The following notations are used:

K ↔ number of equivalent update contributions to preserve
w ↔ the derived "forgetting factor weight" (K-1)/K
W_n ↔ n x n square weighting factor matrix consisting of all zero terms off-diagonal, and the terms 1, w, w², w³, etc., in the diagonal terms for successive rows. For an unweighted system, W ↔ I, an identity matrix.
n ↔ the number of constraint updates included
m_n ↔ the row vector of new independent constraint equation data introduced at step n.
M_n ↔ the non-square matrix of constraint equations for the Least Squares solution to attempt to satisfy, with one column for each parameter to be determined, and n rows, one for each constraint.
N_n ↔ square matrix M^TM constructed from the Least Squares matrix M after inclusion of n constraints and applying weighting matrix W.
y_n ↔ the new dependent (observed output) value introduced at step n.
Y_n ↔ the vector of accumulated dependent values as accumulated through step n.
z_n ↔ the vector of weighted dependent values as accumulated through step n.
p ↔ the column vector of parameter values to be calculated.

The following equations summarize the Linear Least Squares problem and related updating schemes after applying "fading memory" weighting.

Weighted Least Squares best fit problem.
$W \cdot M_{n} \cdot p = W \cdot y_{n}$
Weighted Normal Equations.
${M_{n}}^{T} \cdot W^{2} \cdot M_{n} \cdot p = {M_{n}}^{T} \cdot W^{2} \cdot y_{n}$ $N_{n} \cdot p = z_{n}$
Weighted Normal Equations solution.
$p = {(N_{n})}^{- 1} \cdot z_{n}$
Weighted Normal Equations updating formulas.
$N_{n + 1} = {M_{n + 1}}^{T} \cdot W^{2} \cdot M_{n + 1} = {m_{n + 1}}^{T} m_{n + 1} + W^{2} \cdot N_{n}$

z_{n + 1} = {M_{n + 1}}^{T} \cdot W^{2} \cdot y_{n + 1} = {m_{n + 1}}^{T} y_{n + 1} + W^{2} \cdot z_{n}

Weighted updating is important if model parameter values are expected to "drift" gradually over time. On the other hand, it is a bad idea to "forget" older information if it is just as relevant as newer information. In fact, if you have a long enough history of relevant older information, maybe further updating is pointless.

This consideration will be revisited in the context of Kalman Filtering.

Footnotes:

[1] See the previous article in this series.

Fading Memory in Least Squares Problems

Information accumulation in Normal Equations

Equivalent contribution of past updates

Fading memory updates

Adaptive Least-Squares Solution with Fading Memory