6.5.3. Mathematical derivation for PCA¶

Geometrically, when finding the best-fit line for the swarm of points, our objective was to minimize the error, i.e. the residual distances from each point to the best-fit line is the smallest possible. This is also mathematically equivalent to maximizing the variance of the scores, $t_{a}$ .

We briefly review here what that means. Let $x_{i}^{'}$ be a row from our data, so $x_{i}^{'}$ is a $1 \times K$ vector. We defined the score value for this observation as the distance from the origin, along the direction vector, $p_{1}$ , to the point where we find the perpendicular projection onto $p_{1}$ . This is illustrated below, where the score value for observation $x_{i}$ has a value of $t_{i, 1}$ .

../../figures/pca/component-along-a-vector.svg

Recall from geometry that the cosine of an angle in a right-angled triangle is the ratio of the adjacent side to the hypotenuse. But the cosine of an angle is also used in linear algebra to define the dot-product. Mathematically:

\begin{aligned} \cos θ = \frac{adjacent length}{hypotenuse} = \frac{t_{i, 1}}{‖ x_{i} ‖} & and also \cos θ = \frac{x_{i}^{'} p_{1}}{‖ x_{i} ‖ ‖ p_{1} ‖} \\ \frac{t_{i, 1}}{‖ x_{i} ‖} & = \frac{x_{i}^{'} p_{1}}{‖ x_{i} ‖ ‖ p_{1} ‖} \\ t_{i, 1} & = x_{i}^{'} p_{1} \\ (1 \times 1) & = (1 \times K) (K \times 1) \end{aligned}

where $‖ \cdot ‖$ indicates the length of the enclosed vector, and the length of the direction vector, $p_{1}$ is 1.0, by definition.

Note that $t_{i, 1} = x_{i}^{'} p_{1}$ represents a linear combination

t_{i, 1} = x_{i, 1} p_{1, 1} + x_{i, 2} p_{2, 1} + \dots + x_{i, k} p_{k, 1} + \dots + x_{i, K} p_{K, 1}

So $t_{i, 1}$ is the score value for the $i^{th}$ observation along the first component, and is a linear combination of the $i^{th}$ row of data, $x_{i}$ and the direction vector $p_{1}$ . Notice that there are $K$ terms in the linear combination: each of the $K$ variables contributes to the overall score.

We can calculate the second score value for the $i^{th}$ observation in a similar way:

t_{i, 2} = x_{i, 1} p_{1, 2} + x_{i, 2} p_{2, 2} + \dots + x_{i, k} p_{k, 2} + \dots + x_{i, K} p_{K, 2}

And so on, for the third and subsequent components. We can compactly write in matrix form for the $i^{th}$ observation that:

\begin{aligned} t_{i}^{'} & = x_{i}^{'} P \\ (1 \times A) & = (1 \times K) (K \times A) \end{aligned}

which calculates all $A$ score values for that observation in one go. This is exactly what we derived earlier in the example with the 4 thermometers in the room.

Finally, for an entire matrix of data, $X$ , we can calculate all scores, for all observations:

(1)¶

\begin{aligned} T & = X P \\ (N \times A) & = (N \times K) (K \times A) \end{aligned}