Background

I didn't plan to have an assignment this week, but on reflection after the class on Friday, I feel it is important that you understand the properties of a PCA model and more importantly, understand what the NIPALS algorithm is doing. You may use any software package of your choice. Suggested packages are R (free), Octave (free version of MATLAB), Python (free) or MATLAB ().

This assignment uses the food texture data again (introduced in class 2 and used in assignment 1). Recall there are 5 variables in the data table:

1. Oil: percentage oil in the pastry
2. Density: the product’s density (the higher the number, the more dense the product)
3. Crispy: a crispiness measurement, on a scale from 7 to 15, with 15 being more crispy.
4. Fracture: the angle, in degrees, through which the pasty can be slowly bent before it fractures.
5. Hardness: a sharp point is used to measure the amount of force required before breakage occurs.

Part A

These questions help understand a bit more what eigen-decomposition of the covariance matrix is doing.

1. Mean center and scale the dataset to unit variance. Call this matrix $$\mathbf{X}$$. No need to show any results here.
2. Calculate and report the $$\mathbf{X}'\mathbf{X}$$ matrix. Comment on what you see (does it agree with the scatterplot matrix from the previous assignment?)
3. Calculate the eigenvectors and eigenvalues of that matrix. Report your answer from highest to lowest eigenvalue.
4. Compare the first two eigenvectors to $$\mathbf{p}_1$$ and $$\mathbf{p}_2$$ from the course software.
5. Calculate the $$R^2_a$$ quantity for each of the 5 components: $$R^2_a = \displaystyle \frac{\lambda_a}{\sum_{a=1}^{a=5}{\lambda_a}}$$ and compare them to the $$R^2$$ values reported in the course software.

Part B

The next few questions will help you better understand the regressions that happen inside the NIPALS algorithm while-loop.

1. Calculate the $$\mathbf{t_1}$$ score vector from the $$\mathbf{p}_1$$ and the $$\mathbf{X}$$ matrix and ensure it agrees with the $$\mathbf{t_1}$$ score vector from the software. Also show that the mean of $$\mathbf{t_1}$$ is zero.

2. In a simple least squares model we regress a $$\mathbf{y}$$ vector on an $$\mathbf{x}$$ vector to calculate the intercept and slope in the following equation: $$\mathbf{y} = c + m \mathbf{x}$$.

Let the $$\mathbf{t_1}$$ score vector from the previous question correspond to your $$\mathbf{x}$$-variable and let the Oil column be your $$\mathbf{y}$$-variable.

Plot a scatter plot of these $$\mathbf{x}$$ and $$\mathbf{y}$$-variables.

3. Calculate the least squares model parmeters (slope and intercept) from these data. Comment on the regression slope value. Also calculate and report the $$R^2$$ value from this regression.

4. Repeat questions 2 and 3 for the Density variable (only show the plot in your answers). Also repeat them for the Hardness variable and comment on your results.

5. Two questions to think about (no need to answer anything here). In this least-squares model:

• what is $$\hat{\mathbf{y}}$$?
• what do the residuals from this regression represent?
6. Regress the row corresponding to pastry B758 onto the first loading vector ($$\mathbf{x}$$ in the regression).

1. Plot the data first.
2. Then fit the regression slope and comment on it.
7. Repeat the previous question for pastry B694.