6.5.3. Mathematical derivation for PCA

Geometrically, when finding the best-fit line for the swarm of points, our objective was to minimize the error, i.e. the residual distances from each point to the best-fit line is the smallest possible. This is also mathematically equivalent to maximizing the variance of the scores, \(\mathbf{t}_a\).

We briefly review here what that means. Let \(\mathbf{x}'_i\) be a row from our data, so \(\mathbf{x}'_i\) is a \(1 \times K\) vector. We defined the score value for this observation as the distance from the origin, along the direction vector, \(\mathbf{p}_1\), to the point where we find the perpendicular projection onto \(\mathbf{p}_1\). This is illustrated below, where the score value for observation \(\mathbf{x}_i\) has a value of \(t_{i,1}\).

../../figures/pca/component-along-a-vector.svg

Recall from geometry that the cosine of an angle in a right-angled triangle is the ratio of the adjacent side to the hypotenuse. But the cosine of an angle is also used in linear algebra to define the dot-product. Mathematically:

\[\begin{split}\cos \theta = \dfrac{\text{adjacent length}}{\text{hypotenuse}} = \dfrac{t_{i,1}}{\| \mathbf{x}_i\|} \qquad &\text{and also} \qquad \cos \theta = \dfrac{\mathbf{x}'_i \mathbf{p}_1}{\|\mathbf{x}_i\| \|\mathbf{p}_1\|} \\ \dfrac{t_{i,1}}{\| \mathbf{x}_i\|} &= \dfrac{\mathbf{x}'_i \mathbf{p}_1}{\|\mathbf{x}_i\| \|\mathbf{p}_1\|} \\ t_{i,1} &= \mathbf{x}'_i \mathbf{p}_1 \\ (1 \times 1) &= (1 \times K)(K \times 1)\end{split}\]

where \(\| \cdot \|\) indicates the length of the enclosed vector, and the length of the direction vector, \(\mathbf{p}_1\) is 1.0, by definition.

Note that \(t_{i,1} = \mathbf{x}'_i \mathbf{p}_1\) represents a linear combination

\[t_{i,1} = x_{i,1} p_{1,1} + x_{i,2} p_{2,1} + \ldots + x_{i,k} p_{k,1} + \ldots + x_{i,K} p_{K,1}\]

So \(t_{i,1}\) is the score value for the \(i^\text{th}\) observation along the first component, and is a linear combination of the \(i^\text{th}\) row of data, \(\mathbf{x}_i\) and the direction vector \(\mathbf{p}_1\). Notice that there are \(K\) terms in the linear combination: each of the \(K\) variables contributes to the overall score.

We can calculate the second score value for the \(i^\text{th}\) observation in a similar way:

\[t_{i,2} = x_{i,1} p_{1,2} + x_{i,2} p_{2,2} + \ldots + x_{i,k} p_{k,2} + \ldots + x_{i,K} p_{K,2}\]

And so on, for the third and subsequent components. We can compactly write in matrix form for the \(i^\text{th}\) observation that:

\[\begin{split}\mathbf{t}'_i &= \mathbf{x}'_i \mathbf{P} \\ (1 \times A) &= (1 \times K)(K \times A)\end{split}\]

which calculates all \(A\) score values for that observation in one go. This is exactly what we derived earlier in the example with the 4 thermometers in the room.

Finally, for an entire matrix of data, \(\mathbf{X}\), we can calculate all scores, for all observations:

(1)\[\begin{split}\mathbf{T} &= \mathbf{X} \mathbf{P} \\ (N \times A) &= (N \times K)(K \times A)\end{split}\]