# 6.5.21. PCA Exercises¶

Each exercise introduces a new topic or highlights some interesting aspect of PCA.

## 6.5.21.1. Room temperature data¶

• $$N = 144$$

• $$K = 4$$ + 1 column containing the date and time at which the 4 temperatures were recorded

• Description: Temperature measurements from 4 corners of a room Objectives

Before even fitting the model:

1. How many latent variables do you expect to use in this model? Why?.

2. What do you expect the first loading vector to look like?

Now build a PCA model using any software package.

1. How much variation was explained by the first and second latent variables? Is this result surprising, given the earlier description of the dataset?

2. Plot a time series plot (also called a line plot) of $$t_1$$. Did this match your expectations? Why/why not?

3. Plot a bar plot of the loadings for the second component. Given this bar plot, what are the characteristics of an observation with a large, positive value of $$t_2$$; and a large, negative $$t_2$$ value?

4. Now plot the time series plot for $$t_2$$. Again, does this plot match your expectations?

Now use the concept of brushing to interrogate and learn from the model.

1. Plot a score plot of $$t_1$$ against $$t_2$$.

2. Also plot the time series plot of the raw data.

3. Select a cluster of interest in the score plot and see the brushed values in the raw data. Are these the values you expected to be highlighted?

4. Next plot the Hotelling’s $$T^2$$ line plot, as described earlier. Does the 95% limit in the Hotelling’s $$T^2$$ line plot correspond to the 95% limit in the score plot?

5. Also plot the SPE line plot. Brush the outlier in the SPE plot and find its location in the score plot.

6. Why does this point have a large SPE value?

7. Describe how a 3-D scatter plot would look with $$t_1$$ and $$t_2$$ as the $$(x,y)$$ axes, and SPE as the $$z$$-axis. What have we learned?

• Interpreted that a latent variable is often a true driving force in the system under investigation.

• How to interpret a loadings vector and its corresponding score vector.

• Brushing multivariate and raw data plots to confirm our understanding of the model.

• Learned about Hotelling’s $$T^2$$, whether we plot it as a line plot, or as an ellipse on a scatter plot.

• We have confirmed how the scores are on the model plane, and the SPE is the distance from the model plane to the actual observation.

## 6.5.21.2. Food texture data set¶

1. Fit a PCA model.

2. Report the $$R^2$$ values for the overall model and the $$R^2$$ values for each variable, on a per-component basis for components 1, 2, and 3. Comment on what each latent variable is explaining and by how much.

3. Plot the loadings plot as a bar plot for $$p_1$$. Does this match the values given earlier? Interpret what kind of pastry would have a large positive $$t_1$$ value?

4. What feature(s) of the raw data does the second component explain? Plot sequence-ordered plots of the raw data to confirm your answer.

5. Look for any observations that are unusual. Are there any unusual scores? SPE values? Plot contribution plots for the unusual observations and interpret them.

## 6.5.21.3. Food consumption data set¶

This data set has become a classic data set when learning about multivariate data analysis. It consists of

• $$N=16$$ countries in the European area

• $$K=20$$ food items

• Missing data: yes

• Description: The data table lists for each country the relative consumption of certain food items, such as tea, jam, coffee, yoghurt, and others. 1. Fit a PCA model to the data using 2 components.

2. Plot a loadings plot of $$p_1$$ against $$p_2$$. Which are the important variables in the first component? And the second component?

3. Since each column represents food consumption, how would you interpret a country with a high (positive or negative) $$t_1$$ value? Find countries that meet this criterion. Verify that this country does indeed have this interpretation (hint: use a contribution plot and examine the raw data in the table).

4. Now plot SPE after 2 components (don’t plot the default SPE, make sure it is the SPE only after two components). Use a contribution plot to interpret any interesting outliers.

5. Now add a third component and plot SPE after 3 components. What has happened to the observations you identified in the previous question? Investigate the loadings plot for the third component now (as a bar plot) and see which variables are heavily loaded in the 3rd component.

6. Also plot the $$R^2$$ values for each variable, after two components, and after 3 components. Which variables are modelled by the 3rd component? Does this match with your interpretation of the loadings bar plot in the previous question?

7. Now plot a score plot of the 3rd component against the 1st component. Generate a contribution plot in the score from the interesting observation(s) you selected in part 4. Does this match up with your interpretation of what the 3rd component is modelling?

What we learned:

• Further practice of our skills in interpreting score plots and loading plots.

• How to relate contribution plots to the loadings and the $$R^2$$ values for a particular component.

## 6.5.21.4. Silicon wafer thickness¶

• $$N=184$$

• $$K=9$$

• Description: These are nine thickness measurements recorded from various batches of silicon wafers. One wafer is removed from each batch and the thickness of the wafer is measured at the nine locations, as shown in the illustration.

1. Build a PCA model on all the data.

2. Plot the scores for the first two components. What do you notice? Investigate the outliers, and the raw data for each of these unusual observations. What do you conclude about those observations?

3. Exclude the unusual observations and refit the model.

4. Now plot the scores plot again; do things look better? Record the $$R^2$$ and $$Q^2$$ values (from cross-validation) for the first three components. Are the $$R^2$$ and $$Q^2$$ values close to each other; what does this mean?

5. Plot a loadings plot for the first component. What is your interpretation of $$p_1$$? Given the $$R^2$$ and $$Q^2$$ values for this first component (previous question), what is your interpretation about the variability in this process?

6. And the interpretation of $$p_2$$? From a quality control perspective, if you could remove the variability due to $$p_2$$, how much of the variability would you be removing from the process?

7. Also plot the corresponding time series plot for $$t_1$$. What do you notice in the sequence of score values?

8. Repeat the above question for the second component.

9. Finally, plot both the $$t_1$$ and $$t_2$$ series overlaid on the same plot, in time-order, to see the smaller variance that $$t_2$$ explains.

What we learned:

• Identifying outliers; removing them and refitting the model.

• Variability in a process can very often be interpreted. The $$R^2$$ and $$Q^2$$ values for each component show which part of the variability in the system is due the particular phenomenon modelled by that component.

## 6.5.21.5. Process troubleshooting¶

Recent trends show that the yield of your company’s flagship product is declining. You are uncertain if the supplier of a key raw material is to blame, or if it is due to a change in your process conditions. You begin by investigating the raw material supplier.

The data available has:

• $$N = 24$$

• $$K = 6$$ + 1 designation of process outcome

• Description: 3 of the 6 measurements are size values for the plastic pellets, while the other 3 are the outputs from thermogravimetric analysis (TGA), differential scanning calorimetry (DSC) and thermomechanical analysis (TMA), measured in a laboratory. These 6 measurements are thought to adequately characterize the raw material. Also provided is a designation Adequate or Poor that reflects the process engineer’s opinion of the yield from that lot of materials.

Import the data, and set the Outcome variable as a secondary identifier for each observation, as shown in the illustration below. The observation’s primary identifier is its batch number.

1. Build a latent variable model for all observations and use auto-fit to determine the number of components. If your software does not have and auto-fit features (cross-validation), then use a Pareto plot of the eigenvalues to decide on the number of components.

2. Interpret component 1, 2 and 3 separately (using the loadings bar plot).

3. Now plot the score plot for components 1 and 2, and colour code the score plot with the Outcome variable. Interpret why observations with Poor outcome are at their locations in the score plot (use a contribution plot).

4. What would be your recommendations to your manager to get more of your batches classified as Adequate rather than Poor?

5. Now build a model only on the observations marked as Adequate in the Outcome variable.

6. Re-interpret the loadings plot for $$p_1$$ and $$p_2$$. Is there a substantial difference between this new loadings plot and the previous one?

What we learned:

• How to use an indicator variable in the model to learn more from our score plot.

• How to build a data set, and bring in new observations as testing data.