6.5.15. Testing the PCA model

As mentioned previously there are 3 major steps to building a PCA model for engineering applications. We have already considered the first two steps in the preceding sections.

  1. Preprocessing the data

  2. Building the latent variable model

  3. Testing the model, including testing for the number of components to use

The last step of testing, interpreting and using the model is where one will spend the most time. Preparing the data can be time-consuming the first time, but generally the first two steps are less time-consuming. In this section we investigate how to determine the number of components that should be used in the model and how to use an existing latent variable model. The issue of interpreting a model has been addressed in the section on interpreting scores and interpreting loadings. Using an existing PCA model

In this section we outline the process required to use an existing PCA model. What this means is that you have already calculated the model and validated its usefulness. Now you would like to use the model on a new observation, which we call \(\mathbf{x}'_{\text{new, raw}}\). The method described below can be efficiently applied to many new rows of observations by converting the row vector notation to matrix notation.

  1. Preprocess your vector of new data in the same way as you did when you built the model. For example, if you took the log transform of a certain variable, then you must do so for the corresponding entry in \(\mathbf{x}'_{\text{new, raw}}\). Also apply mean centering and scaling, using the mean centering and scaling information you calculated when you originally built the model.

  2. Call this preprocessed vector \(\mathbf{x}_{\text{new}}\) now; it has size \(K \times 1\), so \(\mathbf{x}'_{\text{new}}\) is a \(1 \times K\) row vector.

  3. Calculate the location, on the model (hyper)plane, where the new observation would project. In other words, we are calculating the scores:

    \[\mathbf{t}'_\text{new} = \mathbf{x}'_{\text{new}} \mathbf{P}\]

    where \(\mathbf{P}\) is the \(K \times A\) matrix of loadings calculated when building the model, and \(\mathbf{t}'_\text{new}\) is a \(1 \times A\) vector of scores for the new observation.

  4. Calculate the residual distance off the model plane. To do this, we require the vector called \(\widehat{\mathbf{x}}'_\text{new}\), the point on the plane, a \(1 \times K\) vector:

    \[\widehat{\mathbf{x}}'_\text{new} = \mathbf{t}'_\text{new} \mathbf{P}'\]
  5. The residual vector is the difference between the actual observation and its projection onto the plane. The \(K\) individual entries inside this residual vector are also the called the contributions to the error.

    \[\mathbf{e}'_\text{new} = \mathbf{x}'_{\text{new}} - \widehat{\mathbf{x}}'_\text{new}\]
  6. And the residual distance is the sum of squares of the entries in the residual vector, followed by taking a square root.

    \[\text{SPE}_\text{new} = \sqrt{\mathbf{e}'_\text{new} \mathbf{e}_\text{new}}\]

    This is called the squared prediction error, SPE, even though it is more accurately a distance.

  7. Another quantity of interest is Hotelling’s \(T^2\) value for the new observation:

    \[T^2_\text{new} = \sum_{a=1}^{a=A}{\left(\dfrac{t_{\text{new},a}}{s_a}\right)^2}\]

    where the \(s_a\) values are the standard deviations for each component’s scores, calculated when the model was built.

The above outline is for the case when there is no missing data in a new observation. When there are missing data present in \(\mathbf{x}'_{\text{new}}\), then we require a method to estimate the score vector, \(\mathbf{t}_\text{new}\) in step 3. Methods for doing this are outlined and compared in the paper by Nelson, Taylor and MacGregor and the paper by Arteaga and Ferrer. After that, the remaining steps are the same, except of course that missing values do not contribute to the residual vector and the SPE.