Assignment 6 - 2011

From Statistics for Engineering
Jump to navigation Jump to search
Due date(s): 10 March 2011 (updated)
Nuvola mimetypes pdf.png (PDF) Assignment questions

<rst> <rst-options: 'toc' = False/> <rst-options: 'reset-figures' = True/>

Assignment objectives

=========

.. rubric:: Assignment objectives

  • To become more comfortable using R to fit, interpret and manipulate least squares models.
  • The questions in this assignment are typical of the exploratory/learning type questions that will be in the take-home midterm.

Question 1 [1.5]

=====

Use the mature `cheddar cheese data set <http://openmv.net/info/cheddar-cheese>`_ for this question.

  1. . Choose any :math:`x`-variable, either ``Acetic`` acid concentration (already log-transformed), ``H2S`` concentration (already log-transformed), or ``Lactic`` acid concentration (in original units) and use this to predict the ``Taste`` variable in the data set. The ``Taste`` is a subjective measurement, presumably measured by a panel of tasters.

Prove that you get the same linear model coefficients, :math:`R^2`, :math:`S_E` and confidence intervals whether or not you first mean center the :math:`x` and :math:`y` variables.

  1. . What is the level of correlation between each of the :math:`x`-variables. Also show a scatterplot matrix to learn what this level of correlation looks like visually.

* Report your correlations as a :math:`3 \times 3` matrix, where there should be 1.0's on the diagonal, and values between :math:`-1` and :math:`+1` on the off-diagonals.

  1. . Build a linear regression that uses all three :math:`x`-variables to predict :math:`y`.

- Report the slope coefficient and confidence interval for each :math:`x`-variable - Report the model's standard error. Has it decreased from the model in part 1? - Report the model's :math:`R^2` value. Has it decreased?

Question 2 [2.5]

====

In this question we will revisit the `bioreactor yield <http://openmv.net/info/bioreactor-yields>`_ data set and fit a linear model with all :math:`x`-variables to predict the yield.

  1. . Provide the interpretation for each coefficient in the model, and also comment on its confidence interval when interpreting it.
  1. . Compare the 3 slope coefficient values you just calculated, to those from the last assignment:

- :math:`\hat{y} = 102.5 - 0.69T`, where :math:`T` is tank temperature - :math:`\hat{y} = -20.3 + 0.016S`, where :math:`S` is impeller speed - :math:`\hat{y} = 54.9 - 16.7B`, where :math:`B` is 1 if baffles are present and :math:`B=0` with no baffles

Explain why your coefficients do not match.

  1. . Are the residuals from the multiple linear regression model normally distributed?
  1. . In this part we are investigating the variance-covariance matrices used to calculate the linear model.

- First center the :math:`x`-variables and the :math:`y`-variable that you used in the model.

*Note*: feel free to use MATLAB, or any other tool to answer this question. If you are using R, then you will benefit from `this page in the R tutorial <Software_tutorial/Vectors_and_matrices>`_. Also, read the help for the ``model.matrix(...)`` function to get the :math:`\mathbf{X}`-matrix. Then read the help for the ``sweep(...)`` function, or more simply use the ``scale(...)`` function to do the mean-centering.

- Show your calculated :math:`\mathbf{X}^T\mathbf{X}` and :math:`\mathbf{X}^T\mathbf{y}` variance-covariance matrices from the centered data.

- Explain why the interpretation of covariances in :math:`\mathbf{X}^T\mathbf{y}` match the results from the full MLR model you calculated in part 1 of this question.

- Calculate :math:`\mathbf{b} =\left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T\mathbf{y}` and show that it agrees with the estimates that R calculated (even though R fits an intercept term, while your :math:`\mathbf{b}` does not).

  1. . What would be the predicted yield for an experiment run without baffles, at 4000 rpm impeller speed, run at a reactor temperature of 90 °C?

Question 3 [3]

==

In this question we will use the `LDPE data <http://openmv.net/info/ldpe>`_ which is data from a high-fidelity simulation of a low-density polyethylene reactor. LDPE reactors are very long, thin tubes. In this particular case the tube is divided in 2 zones, since the feed enters at the start of the tube, and some point further down the tube (start of the second zone). There is a temperature profile along the tube, with a certain maximum temperature somewhere along the length. The maximum temperature in zone 1, ``Tmax1`` is reached some fraction ``z1`` along the length; similarly in zone 2 with the ``Tmax2`` and ``z2`` variables.

We will build a linear model to predict the ``SCB`` variable, the short chain branching (per 1000 carbon atoms) which is an important quality variable for this product. Note that the last 4 rows of data are known to be from abnormal process operation, when the process started to experience a problem. However, we will pretend we didn't know that when building the model, so keep them in for now.


  1. . Use only the following subset of :math:`x`-variables: ``Tmax1``, ``Tmax2``, ``z1`` and ``z2`` and the :math:`y` variable = ``SCB``. Show the relationship between these 5 variables in a scatter plot matrix.

Use this code to get you started (make sure you understand what it is doing)::

LDPE <- read.csv('http://openmv.net/file/ldpe.csv') subdata <- data.frame(cbind(LDPE$Tmax1, LDPE$Tmax2, LDPE$z1, LDPE$z2, LDPE$SCB)) colnames(subdata) <- c("Tmax1", "Tmax2", "z1", "z2", "SCB")

Using bullet points, describe the nature of relationships between the 5 variables, and particularly the relationship to the :math:`y`-variable.

  1. . Let's start with a linear model between ``z2`` and ``SCB``. We will call this the ``z2`` model. Let's examine its residuals:

- Are the residuals normally distributed? - What is the standard error of this model? - Are there any time-based trends in the residuals (the rows in the data are already in time-order)? - Use any other relevant plots of the predicted values, the residuals, the :math:`x`-variable, as described in class, and diagnose the problem with this linear model. - What can be done to fix the problem? (You don't need to implement the fix yet).

.. Non-constant error variance

  1. . Show a plot of the hat-values (leverage) from the ``z2`` model.

- Add suitable horizontal cut-off lines to your hat-value plot. - Identify on your plot the observations that have large leverage on the model - Remove the high-leverage outliers and refit the model. Call this the ``z2.updated`` model - Show the updated hat-values and verify whether the problem has mostly gone away

*Note*: see the R tutorial on how to rebuild a model by removing points

  1. . Use the ``influenceIndexPlot(...)`` function in the ``car`` library on both the ``z2`` model and the ``z2.updated`` model. Interpret what each plot is showing for the two models. You may ignore the *Bonferroni p-values* subplot.

</rst>