Difference between revisions of "Assignment 6 - 2011"

Revision as of 06:14, 18 September 2018

Due date(s):	10 March 2011 (updated)
(PDF)	Assignment questions

Assignment objectives

=========

.. rubric:: Assignment objectives

To become more comfortable using R to fit, interpret and manipulate least squares models.
The questions in this assignment are typical of the exploratory/learning type questions that will be in the take-home midterm.

Question 1 [1.5]

=====

Use the mature `cheddar cheese data set <http://openmv.net/info/cheddar-cheese>`_ for this question.

. Choose any :math:`x`-variable, either ``Acetic`` acid concentration (already log-transformed), ``H2S`` concentration (already log-transformed), or ``Lactic`` acid concentration (in original units) and use this to predict the ``Taste`` variable in the data set. The ``Taste`` is a subjective measurement, presumably measured by a panel of tasters.

Prove that you get the same linear model coefficients, :math:`R^2`, :math:`S_E` and confidence intervals whether or not you first mean center the :math:`x` and :math:`y` variables.

. What is the level of correlation between each of the :math:`x`-variables. Also show a scatterplot matrix to learn what this level of correlation looks like visually.

* Report your correlations as a :math:`3 \times 3` matrix, where there should be 1.0's on the diagonal, and values between :math:`-1` and :math:`+1` on the off-diagonals.

. Build a linear regression that uses all three :math:`x`-variables to predict :math:`y`.

- Report the slope coefficient and confidence interval for each :math:`x`-variable - Report the model's standard error. Has it decreased from the model in part 1? - Report the model's :math:`R^2` value. Has it decreased?

Question 2 [2.5]

====

In this question we will revisit the `bioreactor yield <http://openmv.net/info/bioreactor-yields>`_ data set and fit a linear model with all :math:`x`-variables to predict the yield.

. Provide the interpretation for each coefficient in the model, and also comment on its confidence interval when interpreting it.

. Compare the 3 slope coefficient values you just calculated, to those from the last assignment:

- :math:`\hat{y} = 102.5 - 0.69T`, where :math:`T` is tank temperature - :math:`\hat{y} = -20.3 + 0.016S`, where :math:`S` is impeller speed - :math:`\hat{y} = 54.9 - 16.7B`, where :math:`B` is 1 if baffles are present and :math:`B=0` with no baffles

Explain why your coefficients do not match.

. Are the residuals from the multiple linear regression model normally distributed?

. In this part we are investigating the variance-covariance matrices used to calculate the linear model.

- First center the :math:`x`-variables and the :math:`y`-variable that you used in the model.

*Note*: feel free to use MATLAB, or any other tool to answer this question. If you are using R, then you will benefit from `this page in the R tutorial <http://connectmv.com/tutorials/r-tutorial/vectors-and-matrices/#matrix-operations>`_. Also, read the help for the ``model.matrix(...)`` function to get the :math:`\mathbf{X}`-matrix. Then read the help for the ``sweep(...)`` function, or more simply use the ``scale(...)`` function to do the mean-centering.

- Show your calculated :math:`\mathbf{X}^T\mathbf{X}` and :math:`\mathbf{X}^T\mathbf{y}` variance-covariance matrices from the centered data.

- Explain why the interpretation of covariances in :math:`\mathbf{X}^T\mathbf{y}` match the results from the full MLR model you calculated in part 1 of this question.

- Calculate :math:`\mathbf{b} =\left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T\mathbf{y}` and show that it agrees with the estimates that R calculated (even though R fits an intercept term, while your :math:`\mathbf{b}` does not).

. What would be the predicted yield for an experiment run without baffles, at 4000 rpm impeller speed, run at a reactor temperature of 90 °C?

Question 3 [3]

==

In this question we will use the `LDPE data <http://openmv.net/info/ldpe>`_ which is data from a high-fidelity simulation of a low-density polyethylene reactor. LDPE reactors are very long, thin tubes. In this particular case the tube is divided in 2 zones, since the feed enters at the start of the tube, and some point further down the tube (start of the second zone). There is a temperature profile along the tube, with a certain maximum temperature somewhere along the length. The maximum temperature in zone 1, ``Tmax1`` is reached some fraction ``z1`` along the length; similarly in zone 2 with the ``Tmax2`` and ``z2`` variables.

We will build a linear model to predict the ``SCB`` variable, the short chain branching (per 1000 carbon atoms) which is an important quality variable for this product. Note that the last 4 rows of data are known to be from abnormal process operation, when the process started to experience a problem. However, we will pretend we didn't know that when building the model, so keep them in for now.

. Use only the following subset of :math:`x`-variables: ``Tmax1``, ``Tmax2``, ``z1`` and ``z2`` and the :math:`y` variable = ``SCB``. Show the relationship between these 5 variables in a scatter plot matrix.

Use this code to get you started (make sure you understand what it is doing)::

LDPE <- read.csv('http://openmv.net/file/ldpe.csv') subdata <- data.frame(cbind(LDPE$Tmax1, LDPE$Tmax2, LDPE$z1, LDPE$z2, LDPE$SCB)) colnames(subdata) <- c("Tmax1", "Tmax2", "z1", "z2", "SCB")

Using bullet points, describe the nature of relationships between the 5 variables, and particularly the relationship to the :math:`y`-variable.

. Let's start with a linear model between ``z2`` and ``SCB``. We will call this the ``z2`` model. Let's examine its residuals:

- Are the residuals normally distributed? - What is the standard error of this model? - Are there any time-based trends in the residuals (the rows in the data are already in time-order)? - Use any other relevant plots of the predicted values, the residuals, the :math:`x`-variable, as described in class, and diagnose the problem with this linear model. - What can be done to fix the problem? (You don't need to implement the fix yet).

.. Non-constant error variance

. Show a plot of the hat-values (leverage) from the ``z2`` model.

- Add suitable horizontal cut-off lines to your hat-value plot. - Identify on your plot the observations that have large leverage on the model - Remove the high-leverage outliers and refit the model. Call this the ``z2.updated`` model - Show the updated hat-values and verify whether the problem has mostly gone away

*Note*: see the R tutorial on how to rebuild a model by removing points

. Use the ``influenceIndexPlot(...)`` function in the ``car`` library on both the ``z2`` model and the ``z2.updated`` model. Interpret what each plot is showing for the two models. You may ignore the *Bonferroni p-values* subplot.

</rst>

Difference between revisions of "Assignment 6 - 2011"

Revision as of 06:14, 18 September 2018

Contents

=========

=====

====

==

Navigation menu

@@ Line 1: / Line 1: @@
-To come
+{{OtherSidebar
+| due_dates = 10 March 2011 ('''''updated''''')
+| dates_alt_text = Due date(s)
+| questions_PDF = Assignment-6-2011.pdf
+| questions_text_alt = Assignment questions
+| questions_link =
+| solutions_PDF =
+| solutions_text_alt = Assignment solutions
+| solutions_link =
+| other_instructions =
+}}
+<rst>
+<rst-options: 'toc' = False/>
+<rst-options: 'reset-figures' = True/>
+Assignment objectives
+=====================
+.. rubric:: Assignment objectives
+*	To become more comfortable using R to fit, interpret and manipulate least squares models.
+*	The questions in this assignment are typical of the exploratory/learning type questions that will be in the take-home midterm.
+Question 1 [1.5]
+=================
+Use the mature `cheddar cheese data set <http://openmv.net/info/cheddar-cheese>`_ for this question.
+#.	Choose any :math:`x`-variable, either ``Acetic`` acid concentration (already log-transformed), ``H2S`` concentration  (already log-transformed), or ``Lactic`` acid concentration (in original units) and use this to predict the ``Taste`` variable in the data set.  The ``Taste`` is a subjective measurement, presumably measured by a panel of tasters.
+	Prove that you get the same linear model coefficients, :math:`R^2`, :math:`S_E` and confidence intervals whether or not you first mean center the :math:`x` and :math:`y` variables.
+#.	What is the level of correlation between each of the :math:`x`-variables.  Also show a scatterplot matrix to learn what this level of correlation looks like visually.
+	*	Report your correlations as a :math:`3 \times 3` matrix, where there should be 1.0's on the diagonal, and values between :math:`-1` and :math:`+1` on the off-diagonals.
+#.	Build a linear regression that uses all three :math:`x`-variables to predict :math:`y`.
+	-	Report the slope coefficient and confidence interval for each :math:`x`-variable
+	-	Report the model's standard error.  Has it decreased from the model in part 1?
+	-	Report the model's :math:`R^2` value.  Has it decreased?
+Question 2 [2.5]
+================
+In this question we will revisit the `bioreactor yield <http://openmv.net/info/bioreactor-yields>`_ data set and fit a linear model with all :math:`x`-variables to predict the yield.
+#.	Provide the interpretation for each coefficient in the model, and also comment on its confidence interval when interpreting it.
+#.	Compare the 3 slope coefficient values you just calculated, to those from the last assignment:
+	-	:math:`\hat{y} = 102.5 - 0.69T`, where :math:`T` is tank temperature
+	-	:math:`\hat{y} = -20.3 + 0.016S`, where :math:`S` is impeller speed
+	-	:math:`\hat{y} = 54.9 - 16.7B`, where :math:`B` is 1 if baffles are present and :math:`B=0` with no baffles
+	Explain why your coefficients do not match.
+#.	Are the residuals from the multiple linear regression model normally distributed?
+#.	In this part we are investigating the variance-covariance matrices used to calculate the linear model.
+	-	First center the :math:`x`-variables and the :math:`y`-variable that you used in the model.
+		*Note*: feel free to use MATLAB, or any other tool to answer this question.  If you are using R, then you will benefit from `this page in the R tutorial <http://connectmv.com/tutorials/r-tutorial/vectors-and-matrices/#matrix-operations>`_.  Also, read the help for the ``model.matrix(...)`` function to get the :math:`\mathbf{X}`-matrix.  Then read the help for the ``sweep(...)`` function, or more simply use the ``scale(...)`` function to do the mean-centering.
+	-	Show your calculated :math:`\mathbf{X}^T\mathbf{X}` and :math:`\mathbf{X}^T\mathbf{y}` variance-covariance matrices from the centered data.
+	-	Explain why the interpretation of covariances in :math:`\mathbf{X}^T\mathbf{y}` match the results from the full MLR model you calculated in part 1 of this question.
+	-	Calculate :math:`\mathbf{b} =\left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T\mathbf{y}` and show that it agrees with the estimates that R calculated (even though R fits an intercept term, while your :math:`\mathbf{b}` does not).
+#.	What would be the predicted yield for an experiment run without baffles, at 4000 rpm impeller speed, run at a reactor temperature of 90 °C?
+Question 3 [3]
+==============
+In this question we will use the `LDPE data <http://openmv.net/info/ldpe>`_ which is data from a high-fidelity simulation of a low-density polyethylene reactor.  LDPE reactors are very long, thin tubes.  In this particular case the tube is divided in 2 zones, since the feed enters at the start of the tube, and some point further down the tube (start of the second zone). There is a temperature profile along the tube, with a certain maximum temperature somewhere along the length.  The maximum temperature in zone 1, ``Tmax1`` is reached some fraction ``z1`` along the length; similarly in zone 2 with the ``Tmax2`` and ``z2`` variables.
+We will build a linear model to predict the ``SCB`` variable, the short chain branching (per 1000 carbon atoms) which is an important quality variable for this product.  Note that the last 4 rows of data are known to be from abnormal process operation, when the process started to experience a problem. However, we will pretend we didn't know that when building the model, so keep them in for now.
+#.	Use only the following subset of :math:`x`-variables: ``Tmax1``, ``Tmax2``, ``z1`` and ``z2`` and the :math:`y` variable = ``SCB``.  Show the relationship between these 5 variables in a scatter plot matrix.
+	Use this code to get you started (make sure you understand what it is doing)::
+		LDPE <- read.csv('http://openmv.net/file/ldpe.csv')
+		subdata <- data.frame(cbind(LDPE$Tmax1, LDPE$Tmax2, LDPE$z1, LDPE$z2, LDPE$SCB))
+		colnames(subdata) <- c("Tmax1", "Tmax2", "z1", "z2", "SCB")
+	Using bullet points, describe the nature of relationships between the 5 variables, and particularly the relationship to the :math:`y`-variable.
+#.	Let's start with a linear model between ``z2`` and ``SCB``.  We will call this the ``z2`` model.  Let's examine its residuals:
+	-	Are the residuals normally distributed?
+	-	What is the standard error of this model?
+	-	Are there any time-based trends in the residuals (the rows in the data are already in time-order)?
+	-	Use any other relevant plots of the predicted values, the residuals, the :math:`x`-variable, as described in class, and diagnose the problem with this linear model.
+	-	What can be done to fix the problem? (You don't need to implement the fix yet).
+	..	Non-constant error variance
+#.	Show a plot of the hat-values (leverage) from the ``z2`` model.
+	-	Add suitable horizontal cut-off lines to your hat-value plot.
+	-	Identify on your plot the observations that have large leverage on the model
+	-	Remove the high-leverage outliers and refit the model. Call this the ``z2.updated`` model
+	-	Show the updated hat-values and verify whether the problem has mostly gone away
+	*Note*: see the R tutorial on how to rebuild a model by removing points
+#.	Use the ``influenceIndexPlot(...)`` function in the ``car`` library on both the ``z2`` model and the ``z2.updated`` model.  Interpret what each plot is showing for the two models.  You may ignore the *Bonferroni p-values*  subplot.
+</rst>

Difference between revisions of "Assignment 6 - 2011"

Revision as of 06:14, 18 September 2018

=========

=====

====

==

Navigation menu

Search