# Assignment 2

# Background

I didn't plan to have an assignment this week, but on reflection after the class on Friday, I feel it is important that you understand the properties of a PCA model and more importantly, understand what the NIPALS algorithm is doing. You may use any software package of your choice. Suggested packages are R (free), Octave (free version of MATLAB), Python (free) or MATLAB ($$).

This assignment uses the food texture data again (introduced in class 2 and used in assignment 1). Recall there are 5 variables in the data table:

`Oil`

: percentage oil in the pastry`Density`

: the product’s density (the higher the number, the more dense the product)`Crispy`

: a crispiness measurement, on a scale from 7 to 15, with 15 being more crispy.`Fracture`

: the angle, in degrees, through which the pasty can be slowly bent before it fractures.`Hardness`

: a sharp point is used to measure the amount of force required before breakage occurs.

# Part A

These questions help understand a bit more what eigen-decomposition of the covariance matrix is doing.

- Mean center and scale the dataset to unit variance. Call this matrix \(\mathbf{X}\). No need to show any results here.
- Calculate and report the \(\mathbf{X}'\mathbf{X}\) matrix. Comment on what you see (does it agree with the scatterplot matrix from the previous assignment?)
- Calculate the eigenvectors and eigenvalues of that matrix. Report your answer from highest to lowest eigenvalue.
- Compare the first two eigenvectors to \(\mathbf{p}_1\) and \(\mathbf{p}_2\) from the course software.
- Calculate the \(R^2_a\) quantity for each of the 5 components: \(R^2_a = \displaystyle \frac{\lambda_a}{\sum_{a=1}^{a=5}{\lambda_a}}\) and compare them to the \(R^2\) values reported in the course software.

# Part B

The next few questions will help you better understand the regressions that happen inside the NIPALS algorithm `while`

-loop.

Calculate the \(\mathbf{t_1}\) score vector from the \(\mathbf{p}_1\) and the \(\mathbf{X}\) matrix and ensure it agrees with the \(\mathbf{t_1}\) score vector from the software. Also show that the mean of \(\mathbf{t_1}\) is zero.

In a simple least squares model we regress a \(\mathbf{y}\) vector on an \(\mathbf{x}\) vector to calculate the intercept and slope in the following equation: \(\mathbf{y} = c + m \mathbf{x}\).

Let the \(\mathbf{t_1}\) score vector from the previous question correspond to your \(\mathbf{x}\)-variable and let the

`Oil`

column be your \(\mathbf{y}\)-variable.Plot a scatter plot of these \(\mathbf{x}\) and \(\mathbf{y}\)-variables.

Calculate the least squares model parmeters (slope and intercept) from these data. Comment on the regression slope value. Also calculate and report the \(R^2\) value from this regression.

Repeat questions 2 and 3 for the

`Density`

variable (only show the plot in your answers). Also repeat them for the`Hardness`

variable and comment on your results.Two questions to think about (no need to answer anything here). In this least-squares model:

- what is \(\hat{\mathbf{y}}\)?
- what do the residuals from this regression represent?

Regress the row corresponding to pastry

`B758`

onto the first loading vector (\(\mathbf{x}\) in the regression).- Plot the data first.
- Then fit the regression slope and comment on it.

Repeat the previous question for pastry

`B694`

.

**Hand in your answers at the next class**.