Assignment 2
<rst> <rst-options: 'toc' = False/> Background
=
I didn't plan to have an assignment this week, but on reflection after the class on Friday, I feel it is important that you understand the properties of a PCA model and more importantly, understand what the NIPALS algorithm is doing. You may use any software package of your choice. Suggested packages are `R <http://www.r-project.org/>`_ (free), `Octave <http://www.gnu.org/software/octave/>`_ (free version of MATLAB), `Python <http://enthought.com/products/epd_free.php>`_ (free) or `MATLAB <http://mathworks.com>`_ ($$).
This assignment uses the `food texture <http://openmv.net/info/food-texture>`_ data again (introduced in class 2 and used in assignment 1). Recall there are 5 variables in the data table:
- . ``Oil``: percentage oil in the pastry
- . ``Density``: the product’s density (the higher the number, the more dense the product)
- . ``Crispy``: a crispiness measurement, on a scale from 7 to 15, with 15 being more crispy.
- . ``Fracture``: the angle, in degrees, through which the pasty can be slowly bent before it fractures.
- . ``Hardness``: a sharp point is used to measure the amount of force required before breakage occurs.
Part A
=
These questions help understand a bit more what eigen-decomposition of the covariance matrix is doing.
- . Mean center and scale the dataset to unit variance. Call this matrix :math:`\mathbf{X}`. No need to show any results here.
- . Calculate and report the :math:`\mathbf{X}'\mathbf{X}` matrix. Comment on what you see (does it agree with the scatterplot matrix from the previous assignment?)
- . Calculate the eigenvectors and eigenvalues of that matrix. Report your answer from highest to lowest eigenvalue.
- . Compare the first two eigenvectors to :math:`\mathbf{p}_1` and :math:`\mathbf{p}_2` from the course software.
- . Calculate the :math:`R^2_a` quantity for each of the 5 components: :math:`R^2_a = \displaystyle \frac{\lambda_a}{\sum_{a=1}^{a=5}{\lambda_a}}` and compare them to the :math:`R^2` values reported in the course software.
Part B
=
The next few questions will help you better understand the regressions that happen inside the NIPALS algorithm ``while``-loop.
- . Calculate the :math:`\mathbf{t_1}` score vector from the :math:`\mathbf{p}_1` and the :math:`\mathbf{X}` matrix and ensure it agrees with the :math:`\mathbf{t_1}` score vector from the software. Also show that the mean of :math:`\mathbf{t_1}` is zero.
- . In a simple least squares model we regress a :math:`\mathbf{y}` vector on an :math:`\mathbf{x}` vector to calculate the intercept and slope in the following equation: :math:`\mathbf{y} = c + m \mathbf{x}`.
Let the :math:`\mathbf{t_1}` score vector from the previous question correspond to your :math:`\mathbf{x}`-variable and let the ``Oil`` column be your :math:`\mathbf{y}`-variable.
Plot a scatter plot of these :math:`\mathbf{x}` and :math:`\mathbf{y}`-variables.
- . Calculate the least squares model parmeters (slope and intercept) from these data. Comment on the regression slope value. Also calculate and report the :math:`R^2` value from this regression.
- . Repeat questions 2 and 3 for the ``Density`` variable (only show the plot in your answers). Also repeat them for the ``Hardness`` variable and comment on your results.
- . Two questions to think about (no need to answer anything here). In this least-squares model:
* what is :math:`\hat{\mathbf{y}}`? * what do the residuals from this regression represent?
- . Regress the row corresponding to pastry ``B758`` onto the first loading vector (:math:`\mathbf{x}` in the regression).
#. Plot the data first. #. Then fit the regression slope and comment on it.
- . Repeat the previous question for pastry ``B694``.
- Hand in your answers at the next class**.
</rst>