Difference between revisions of "Assignment 6 - 2011"
Kevin Dunn (talk | contribs) (Created page with "To come") |
Kevin Dunn (talk | contribs) |
||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
To | {{OtherSidebar | ||
| due_dates = 10 March 2011 ('''''updated''''') | |||
| dates_alt_text = Due date(s) | |||
| questions_PDF = Assignment-6-2011.pdf | |||
| questions_text_alt = Assignment questions | |||
| questions_link = | |||
| solutions_PDF = | |||
| solutions_text_alt = Assignment solutions | |||
| solutions_link = | |||
| other_instructions = | |||
}} | |||
<rst> | |||
<rst-options: 'toc' = False/> | |||
<rst-options: 'reset-figures' = True/> | |||
Assignment objectives | |||
===================== | |||
.. rubric:: Assignment objectives | |||
* To become more comfortable using R to fit, interpret and manipulate least squares models. | |||
* The questions in this assignment are typical of the exploratory/learning type questions that will be in the take-home midterm. | |||
Question 1 [1.5] | |||
================= | |||
Use the mature `cheddar cheese data set <http://openmv.net/info/cheddar-cheese>`_ for this question. | |||
#. Choose any :math:`x`-variable, either ``Acetic`` acid concentration (already log-transformed), ``H2S`` concentration (already log-transformed), or ``Lactic`` acid concentration (in original units) and use this to predict the ``Taste`` variable in the data set. The ``Taste`` is a subjective measurement, presumably measured by a panel of tasters. | |||
Prove that you get the same linear model coefficients, :math:`R^2`, :math:`S_E` and confidence intervals whether or not you first mean center the :math:`x` and :math:`y` variables. | |||
#. What is the level of correlation between each of the :math:`x`-variables. Also show a scatterplot matrix to learn what this level of correlation looks like visually. | |||
* Report your correlations as a :math:`3 \times 3` matrix, where there should be 1.0's on the diagonal, and values between :math:`-1` and :math:`+1` on the off-diagonals. | |||
#. Build a linear regression that uses all three :math:`x`-variables to predict :math:`y`. | |||
- Report the slope coefficient and confidence interval for each :math:`x`-variable | |||
- Report the model's standard error. Has it decreased from the model in part 1? | |||
- Report the model's :math:`R^2` value. Has it decreased? | |||
Question 2 [2.5] | |||
================ | |||
In this question we will revisit the `bioreactor yield <http://openmv.net/info/bioreactor-yields>`_ data set and fit a linear model with all :math:`x`-variables to predict the yield. | |||
#. Provide the interpretation for each coefficient in the model, and also comment on its confidence interval when interpreting it. | |||
#. Compare the 3 slope coefficient values you just calculated, to those from the last assignment: | |||
- :math:`\hat{y} = 102.5 - 0.69T`, where :math:`T` is tank temperature | |||
- :math:`\hat{y} = -20.3 + 0.016S`, where :math:`S` is impeller speed | |||
- :math:`\hat{y} = 54.9 - 16.7B`, where :math:`B` is 1 if baffles are present and :math:`B=0` with no baffles | |||
Explain why your coefficients do not match. | |||
#. Are the residuals from the multiple linear regression model normally distributed? | |||
#. In this part we are investigating the variance-covariance matrices used to calculate the linear model. | |||
- First center the :math:`x`-variables and the :math:`y`-variable that you used in the model. | |||
*Note*: feel free to use MATLAB, or any other tool to answer this question. If you are using R, then you will benefit from `this page in the R tutorial <Software_tutorial/Vectors_and_matrices>`_. Also, read the help for the ``model.matrix(...)`` function to get the :math:`\mathbf{X}`-matrix. Then read the help for the ``sweep(...)`` function, or more simply use the ``scale(...)`` function to do the mean-centering. | |||
- Show your calculated :math:`\mathbf{X}^T\mathbf{X}` and :math:`\mathbf{X}^T\mathbf{y}` variance-covariance matrices from the centered data. | |||
- Explain why the interpretation of covariances in :math:`\mathbf{X}^T\mathbf{y}` match the results from the full MLR model you calculated in part 1 of this question. | |||
- Calculate :math:`\mathbf{b} =\left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T\mathbf{y}` and show that it agrees with the estimates that R calculated (even though R fits an intercept term, while your :math:`\mathbf{b}` does not). | |||
#. What would be the predicted yield for an experiment run without baffles, at 4000 rpm impeller speed, run at a reactor temperature of 90 °C? | |||
Question 3 [3] | |||
============== | |||
In this question we will use the `LDPE data <http://openmv.net/info/ldpe>`_ which is data from a high-fidelity simulation of a low-density polyethylene reactor. LDPE reactors are very long, thin tubes. In this particular case the tube is divided in 2 zones, since the feed enters at the start of the tube, and some point further down the tube (start of the second zone). There is a temperature profile along the tube, with a certain maximum temperature somewhere along the length. The maximum temperature in zone 1, ``Tmax1`` is reached some fraction ``z1`` along the length; similarly in zone 2 with the ``Tmax2`` and ``z2`` variables. | |||
We will build a linear model to predict the ``SCB`` variable, the short chain branching (per 1000 carbon atoms) which is an important quality variable for this product. Note that the last 4 rows of data are known to be from abnormal process operation, when the process started to experience a problem. However, we will pretend we didn't know that when building the model, so keep them in for now. | |||
#. Use only the following subset of :math:`x`-variables: ``Tmax1``, ``Tmax2``, ``z1`` and ``z2`` and the :math:`y` variable = ``SCB``. Show the relationship between these 5 variables in a scatter plot matrix. | |||
Use this code to get you started (make sure you understand what it is doing):: | |||
LDPE <- read.csv('http://openmv.net/file/ldpe.csv') | |||
subdata <- data.frame(cbind(LDPE$Tmax1, LDPE$Tmax2, LDPE$z1, LDPE$z2, LDPE$SCB)) | |||
colnames(subdata) <- c("Tmax1", "Tmax2", "z1", "z2", "SCB") | |||
Using bullet points, describe the nature of relationships between the 5 variables, and particularly the relationship to the :math:`y`-variable. | |||
#. Let's start with a linear model between ``z2`` and ``SCB``. We will call this the ``z2`` model. Let's examine its residuals: | |||
- Are the residuals normally distributed? | |||
- What is the standard error of this model? | |||
- Are there any time-based trends in the residuals (the rows in the data are already in time-order)? | |||
- Use any other relevant plots of the predicted values, the residuals, the :math:`x`-variable, as described in class, and diagnose the problem with this linear model. | |||
- What can be done to fix the problem? (You don't need to implement the fix yet). | |||
.. Non-constant error variance | |||
#. Show a plot of the hat-values (leverage) from the ``z2`` model. | |||
- Add suitable horizontal cut-off lines to your hat-value plot. | |||
- Identify on your plot the observations that have large leverage on the model | |||
- Remove the high-leverage outliers and refit the model. Call this the ``z2.updated`` model | |||
- Show the updated hat-values and verify whether the problem has mostly gone away | |||
*Note*: see the R tutorial on how to rebuild a model by removing points | |||
#. Use the ``influenceIndexPlot(...)`` function in the ``car`` library on both the ``z2`` model and the ``z2.updated`` model. Interpret what each plot is showing for the two models. You may ignore the *Bonferroni p-values* subplot. | |||
</rst> |
Latest revision as of 06:15, 18 September 2018
Due date(s): | 10 March 2011 (updated) |
(PDF) | Assignment questions |
<rst> <rst-options: 'toc' = False/> <rst-options: 'reset-figures' = True/>
Assignment objectives
=========
.. rubric:: Assignment objectives
- To become more comfortable using R to fit, interpret and manipulate least squares models.
- The questions in this assignment are typical of the exploratory/learning type questions that will be in the take-home midterm.
Question 1 [1.5]
=====
Use the mature `cheddar cheese data set <http://openmv.net/info/cheddar-cheese>`_ for this question.
- . Choose any :math:`x`-variable, either ``Acetic`` acid concentration (already log-transformed), ``H2S`` concentration (already log-transformed), or ``Lactic`` acid concentration (in original units) and use this to predict the ``Taste`` variable in the data set. The ``Taste`` is a subjective measurement, presumably measured by a panel of tasters.
Prove that you get the same linear model coefficients, :math:`R^2`, :math:`S_E` and confidence intervals whether or not you first mean center the :math:`x` and :math:`y` variables.
- . What is the level of correlation between each of the :math:`x`-variables. Also show a scatterplot matrix to learn what this level of correlation looks like visually.
* Report your correlations as a :math:`3 \times 3` matrix, where there should be 1.0's on the diagonal, and values between :math:`-1` and :math:`+1` on the off-diagonals.
- . Build a linear regression that uses all three :math:`x`-variables to predict :math:`y`.
- Report the slope coefficient and confidence interval for each :math:`x`-variable - Report the model's standard error. Has it decreased from the model in part 1? - Report the model's :math:`R^2` value. Has it decreased?
Question 2 [2.5]
====
In this question we will revisit the `bioreactor yield <http://openmv.net/info/bioreactor-yields>`_ data set and fit a linear model with all :math:`x`-variables to predict the yield.
- . Provide the interpretation for each coefficient in the model, and also comment on its confidence interval when interpreting it.
- . Compare the 3 slope coefficient values you just calculated, to those from the last assignment:
- :math:`\hat{y} = 102.5 - 0.69T`, where :math:`T` is tank temperature - :math:`\hat{y} = -20.3 + 0.016S`, where :math:`S` is impeller speed - :math:`\hat{y} = 54.9 - 16.7B`, where :math:`B` is 1 if baffles are present and :math:`B=0` with no baffles
Explain why your coefficients do not match.
- . Are the residuals from the multiple linear regression model normally distributed?
- . In this part we are investigating the variance-covariance matrices used to calculate the linear model.
- First center the :math:`x`-variables and the :math:`y`-variable that you used in the model.
*Note*: feel free to use MATLAB, or any other tool to answer this question. If you are using R, then you will benefit from `this page in the R tutorial <Software_tutorial/Vectors_and_matrices>`_. Also, read the help for the ``model.matrix(...)`` function to get the :math:`\mathbf{X}`-matrix. Then read the help for the ``sweep(...)`` function, or more simply use the ``scale(...)`` function to do the mean-centering.
- Show your calculated :math:`\mathbf{X}^T\mathbf{X}` and :math:`\mathbf{X}^T\mathbf{y}` variance-covariance matrices from the centered data.
- Explain why the interpretation of covariances in :math:`\mathbf{X}^T\mathbf{y}` match the results from the full MLR model you calculated in part 1 of this question.
- Calculate :math:`\mathbf{b} =\left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T\mathbf{y}` and show that it agrees with the estimates that R calculated (even though R fits an intercept term, while your :math:`\mathbf{b}` does not).
- . What would be the predicted yield for an experiment run without baffles, at 4000 rpm impeller speed, run at a reactor temperature of 90 °C?
Question 3 [3]
==
In this question we will use the `LDPE data <http://openmv.net/info/ldpe>`_ which is data from a high-fidelity simulation of a low-density polyethylene reactor. LDPE reactors are very long, thin tubes. In this particular case the tube is divided in 2 zones, since the feed enters at the start of the tube, and some point further down the tube (start of the second zone). There is a temperature profile along the tube, with a certain maximum temperature somewhere along the length. The maximum temperature in zone 1, ``Tmax1`` is reached some fraction ``z1`` along the length; similarly in zone 2 with the ``Tmax2`` and ``z2`` variables.
We will build a linear model to predict the ``SCB`` variable, the short chain branching (per 1000 carbon atoms) which is an important quality variable for this product. Note that the last 4 rows of data are known to be from abnormal process operation, when the process started to experience a problem. However, we will pretend we didn't know that when building the model, so keep them in for now.
- . Use only the following subset of :math:`x`-variables: ``Tmax1``, ``Tmax2``, ``z1`` and ``z2`` and the :math:`y` variable = ``SCB``. Show the relationship between these 5 variables in a scatter plot matrix.
Use this code to get you started (make sure you understand what it is doing)::
LDPE <- read.csv('http://openmv.net/file/ldpe.csv') subdata <- data.frame(cbind(LDPE$Tmax1, LDPE$Tmax2, LDPE$z1, LDPE$z2, LDPE$SCB)) colnames(subdata) <- c("Tmax1", "Tmax2", "z1", "z2", "SCB")
Using bullet points, describe the nature of relationships between the 5 variables, and particularly the relationship to the :math:`y`-variable.
- . Let's start with a linear model between ``z2`` and ``SCB``. We will call this the ``z2`` model. Let's examine its residuals:
- Are the residuals normally distributed? - What is the standard error of this model? - Are there any time-based trends in the residuals (the rows in the data are already in time-order)? - Use any other relevant plots of the predicted values, the residuals, the :math:`x`-variable, as described in class, and diagnose the problem with this linear model. - What can be done to fix the problem? (You don't need to implement the fix yet).
.. Non-constant error variance
- . Show a plot of the hat-values (leverage) from the ``z2`` model.
- Add suitable horizontal cut-off lines to your hat-value plot. - Identify on your plot the observations that have large leverage on the model - Remove the high-leverage outliers and refit the model. Call this the ``z2.updated`` model - Show the updated hat-values and verify whether the problem has mostly gone away
*Note*: see the R tutorial on how to rebuild a model by removing points
- . Use the ``influenceIndexPlot(...)`` function in the ``car`` library on both the ``z2`` model and the ``z2.updated`` model. Interpret what each plot is showing for the two models. You may ignore the *Bonferroni p-values* subplot.
</rst>