Difference between revisions of "Assignment 6 - 2011"

From Statistics for Engineering
Jump to navigation Jump to search
(Created page with "To come")
 
Line 1: Line 1:
To come
{{OtherSidebar
| due_dates = 10 March 2011 ('''''updated''''')
| dates_alt_text = Due date(s)
| questions_PDF = Assignment-6-2011.pdf
| questions_text_alt = Assignment questions
| questions_link =
| solutions_PDF =
| solutions_text_alt = Assignment solutions
| solutions_link =
| other_instructions =
}}
<rst>
<rst-options: 'toc' = False/>
<rst-options: 'reset-figures' = True/>
 
Assignment objectives
=====================
 
 
 
.. rubric:: Assignment objectives
 
* To become more comfortable using R to fit, interpret and manipulate least squares models.
* The questions in this assignment are typical of the exploratory/learning type questions that will be in the take-home midterm.
 
Question 1 [1.5]
=================
 
Use the mature `cheddar cheese data set <http://openmv.net/info/cheddar-cheese>`_ for this question.
 
#. Choose any :math:`x`-variable, either ``Acetic`` acid concentration (already log-transformed), ``H2S`` concentration  (already log-transformed), or ``Lactic`` acid concentration (in original units) and use this to predict the ``Taste`` variable in the data set.  The ``Taste`` is a subjective measurement, presumably measured by a panel of tasters.
 
Prove that you get the same linear model coefficients, :math:`R^2`, :math:`S_E` and confidence intervals whether or not you first mean center the :math:`x` and :math:`y` variables.
#. What is the level of correlation between each of the :math:`x`-variables.  Also show a scatterplot matrix to learn what this level of correlation looks like visually.
 
* Report your correlations as a :math:`3 \times 3` matrix, where there should be 1.0's on the diagonal, and values between :math:`-1` and :math:`+1` on the off-diagonals.
 
#. Build a linear regression that uses all three :math:`x`-variables to predict :math:`y`.
 
- Report the slope coefficient and confidence interval for each :math:`x`-variable
- Report the model's standard error.  Has it decreased from the model in part 1?
- Report the model's :math:`R^2` value.  Has it decreased?
 
Question 2 [2.5]
================
 
In this question we will revisit the `bioreactor yield <http://openmv.net/info/bioreactor-yields>`_ data set and fit a linear model with all :math:`x`-variables to predict the yield.
 
#. Provide the interpretation for each coefficient in the model, and also comment on its confidence interval when interpreting it.
 
#. Compare the 3 slope coefficient values you just calculated, to those from the last assignment:
 
- :math:`\hat{y} = 102.5 - 0.69T`, where :math:`T` is tank temperature
- :math:`\hat{y} = -20.3 + 0.016S`, where :math:`S` is impeller speed
- :math:`\hat{y} = 54.9 - 16.7B`, where :math:`B` is 1 if baffles are present and :math:`B=0` with no baffles
Explain why your coefficients do not match.
#. Are the residuals from the multiple linear regression model normally distributed?
 
#. In this part we are investigating the variance-covariance matrices used to calculate the linear model.
 
- First center the :math:`x`-variables and the :math:`y`-variable that you used in the model.
*Note*: feel free to use MATLAB, or any other tool to answer this question.  If you are using R, then you will benefit from `this page in the R tutorial <http://connectmv.com/tutorials/r-tutorial/vectors-and-matrices/#matrix-operations>`_.  Also, read the help for the ``model.matrix(...)`` function to get the :math:`\mathbf{X}`-matrix.  Then read the help for the ``sweep(...)`` function, or more simply use the ``scale(...)`` function to do the mean-centering.
 
- Show your calculated :math:`\mathbf{X}^T\mathbf{X}` and :math:`\mathbf{X}^T\mathbf{y}` variance-covariance matrices from the centered data.
 
- Explain why the interpretation of covariances in :math:`\mathbf{X}^T\mathbf{y}` match the results from the full MLR model you calculated in part 1 of this question.
- Calculate :math:`\mathbf{b} =\left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T\mathbf{y}` and show that it agrees with the estimates that R calculated (even though R fits an intercept term, while your :math:`\mathbf{b}` does not).
#. What would be the predicted yield for an experiment run without baffles, at 4000 rpm impeller speed, run at a reactor temperature of 90 °C?
 
Question 3 [3]
==============
 
In this question we will use the `LDPE data <http://openmv.net/info/ldpe>`_ which is data from a high-fidelity simulation of a low-density polyethylene reactor.  LDPE reactors are very long, thin tubes.  In this particular case the tube is divided in 2 zones, since the feed enters at the start of the tube, and some point further down the tube (start of the second zone). There is a temperature profile along the tube, with a certain maximum temperature somewhere along the length.  The maximum temperature in zone 1, ``Tmax1`` is reached some fraction ``z1`` along the length; similarly in zone 2 with the ``Tmax2`` and ``z2`` variables.
 
We will build a linear model to predict the ``SCB`` variable, the short chain branching (per 1000 carbon atoms) which is an important quality variable for this product.  Note that the last 4 rows of data are known to be from abnormal process operation, when the process started to experience a problem. However, we will pretend we didn't know that when building the model, so keep them in for now.
 
 
#. Use only the following subset of :math:`x`-variables: ``Tmax1``, ``Tmax2``, ``z1`` and ``z2`` and the :math:`y` variable = ``SCB``.  Show the relationship between these 5 variables in a scatter plot matrix.
Use this code to get you started (make sure you understand what it is doing)::
LDPE <- read.csv('http://openmv.net/file/ldpe.csv')
subdata <- data.frame(cbind(LDPE$Tmax1, LDPE$Tmax2, LDPE$z1, LDPE$z2, LDPE$SCB))
colnames(subdata) <- c("Tmax1", "Tmax2", "z1", "z2", "SCB")
Using bullet points, describe the nature of relationships between the 5 variables, and particularly the relationship to the :math:`y`-variable.
#. Let's start with a linear model between ``z2`` and ``SCB``.  We will call this the ``z2`` model.  Let's examine its residuals:
 
- Are the residuals normally distributed?
- What is the standard error of this model?
- Are there any time-based trends in the residuals (the rows in the data are already in time-order)?
- Use any other relevant plots of the predicted values, the residuals, the :math:`x`-variable, as described in class, and diagnose the problem with this linear model.
- What can be done to fix the problem? (You don't need to implement the fix yet). 
 
.. Non-constant error variance
#. Show a plot of the hat-values (leverage) from the ``z2`` model.
 
- Add suitable horizontal cut-off lines to your hat-value plot.
- Identify on your plot the observations that have large leverage on the model
- Remove the high-leverage outliers and refit the model. Call this the ``z2.updated`` model
- Show the updated hat-values and verify whether the problem has mostly gone away
*Note*: see the R tutorial on how to rebuild a model by removing points
 
#. Use the ``influenceIndexPlot(...)`` function in the ``car`` library on both the ``z2`` model and the ``z2.updated`` model.  Interpret what each plot is showing for the two models.  You may ignore the *Bonferroni p-values*  subplot.
</rst>

Revision as of 06:14, 18 September 2018

Due date(s): 10 March 2011 (updated)
Nuvola mimetypes pdf.png (PDF) Assignment questions

<rst> <rst-options: 'toc' = False/> <rst-options: 'reset-figures' = True/>

Assignment objectives

=========

.. rubric:: Assignment objectives

  • To become more comfortable using R to fit, interpret and manipulate least squares models.
  • The questions in this assignment are typical of the exploratory/learning type questions that will be in the take-home midterm.

Question 1 [1.5]

=====

Use the mature `cheddar cheese data set <http://openmv.net/info/cheddar-cheese>`_ for this question.

  1. . Choose any :math:`x`-variable, either ``Acetic`` acid concentration (already log-transformed), ``H2S`` concentration (already log-transformed), or ``Lactic`` acid concentration (in original units) and use this to predict the ``Taste`` variable in the data set. The ``Taste`` is a subjective measurement, presumably measured by a panel of tasters.

Prove that you get the same linear model coefficients, :math:`R^2`, :math:`S_E` and confidence intervals whether or not you first mean center the :math:`x` and :math:`y` variables.

  1. . What is the level of correlation between each of the :math:`x`-variables. Also show a scatterplot matrix to learn what this level of correlation looks like visually.

* Report your correlations as a :math:`3 \times 3` matrix, where there should be 1.0's on the diagonal, and values between :math:`-1` and :math:`+1` on the off-diagonals.

  1. . Build a linear regression that uses all three :math:`x`-variables to predict :math:`y`.

- Report the slope coefficient and confidence interval for each :math:`x`-variable - Report the model's standard error. Has it decreased from the model in part 1? - Report the model's :math:`R^2` value. Has it decreased?

Question 2 [2.5]

====

In this question we will revisit the `bioreactor yield <http://openmv.net/info/bioreactor-yields>`_ data set and fit a linear model with all :math:`x`-variables to predict the yield.

  1. . Provide the interpretation for each coefficient in the model, and also comment on its confidence interval when interpreting it.
  1. . Compare the 3 slope coefficient values you just calculated, to those from the last assignment:

- :math:`\hat{y} = 102.5 - 0.69T`, where :math:`T` is tank temperature - :math:`\hat{y} = -20.3 + 0.016S`, where :math:`S` is impeller speed - :math:`\hat{y} = 54.9 - 16.7B`, where :math:`B` is 1 if baffles are present and :math:`B=0` with no baffles

Explain why your coefficients do not match.

  1. . Are the residuals from the multiple linear regression model normally distributed?
  1. . In this part we are investigating the variance-covariance matrices used to calculate the linear model.

- First center the :math:`x`-variables and the :math:`y`-variable that you used in the model.

*Note*: feel free to use MATLAB, or any other tool to answer this question. If you are using R, then you will benefit from `this page in the R tutorial <http://connectmv.com/tutorials/r-tutorial/vectors-and-matrices/#matrix-operations>`_. Also, read the help for the ``model.matrix(...)`` function to get the :math:`\mathbf{X}`-matrix. Then read the help for the ``sweep(...)`` function, or more simply use the ``scale(...)`` function to do the mean-centering.

- Show your calculated :math:`\mathbf{X}^T\mathbf{X}` and :math:`\mathbf{X}^T\mathbf{y}` variance-covariance matrices from the centered data.

- Explain why the interpretation of covariances in :math:`\mathbf{X}^T\mathbf{y}` match the results from the full MLR model you calculated in part 1 of this question.

- Calculate :math:`\mathbf{b} =\left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T\mathbf{y}` and show that it agrees with the estimates that R calculated (even though R fits an intercept term, while your :math:`\mathbf{b}` does not).

  1. . What would be the predicted yield for an experiment run without baffles, at 4000 rpm impeller speed, run at a reactor temperature of 90 °C?

Question 3 [3]

==

In this question we will use the `LDPE data <http://openmv.net/info/ldpe>`_ which is data from a high-fidelity simulation of a low-density polyethylene reactor. LDPE reactors are very long, thin tubes. In this particular case the tube is divided in 2 zones, since the feed enters at the start of the tube, and some point further down the tube (start of the second zone). There is a temperature profile along the tube, with a certain maximum temperature somewhere along the length. The maximum temperature in zone 1, ``Tmax1`` is reached some fraction ``z1`` along the length; similarly in zone 2 with the ``Tmax2`` and ``z2`` variables.

We will build a linear model to predict the ``SCB`` variable, the short chain branching (per 1000 carbon atoms) which is an important quality variable for this product. Note that the last 4 rows of data are known to be from abnormal process operation, when the process started to experience a problem. However, we will pretend we didn't know that when building the model, so keep them in for now.


  1. . Use only the following subset of :math:`x`-variables: ``Tmax1``, ``Tmax2``, ``z1`` and ``z2`` and the :math:`y` variable = ``SCB``. Show the relationship between these 5 variables in a scatter plot matrix.

Use this code to get you started (make sure you understand what it is doing)::

LDPE <- read.csv('http://openmv.net/file/ldpe.csv') subdata <- data.frame(cbind(LDPE$Tmax1, LDPE$Tmax2, LDPE$z1, LDPE$z2, LDPE$SCB)) colnames(subdata) <- c("Tmax1", "Tmax2", "z1", "z2", "SCB")

Using bullet points, describe the nature of relationships between the 5 variables, and particularly the relationship to the :math:`y`-variable.

  1. . Let's start with a linear model between ``z2`` and ``SCB``. We will call this the ``z2`` model. Let's examine its residuals:

- Are the residuals normally distributed? - What is the standard error of this model? - Are there any time-based trends in the residuals (the rows in the data are already in time-order)? - Use any other relevant plots of the predicted values, the residuals, the :math:`x`-variable, as described in class, and diagnose the problem with this linear model. - What can be done to fix the problem? (You don't need to implement the fix yet).

.. Non-constant error variance

  1. . Show a plot of the hat-values (leverage) from the ``z2`` model.

- Add suitable horizontal cut-off lines to your hat-value plot. - Identify on your plot the observations that have large leverage on the model - Remove the high-leverage outliers and refit the model. Call this the ``z2.updated`` model - Show the updated hat-values and verify whether the problem has mostly gone away

*Note*: see the R tutorial on how to rebuild a model by removing points

  1. . Use the ``influenceIndexPlot(...)`` function in the ``car`` library on both the ``z2`` model and the ``z2.updated`` model. Interpret what each plot is showing for the two models. You may ignore the *Bonferroni p-values* subplot.

</rst>