Assignment 5 - 2011 - Solution
Due date(s): | 16 February 2011 |
(PDF) | Assignment questions |
(PDF) | Assignment solutions |
Assignment objectives
Assignment objective: explore and understand least squares models a bit more.
Brief solutions (i.e. using bullet points and short explanations) are provided for questions 1 and 2, but they are complete. You may answer questions in the midterm and final exam at this level.
Question 1 [2]
Use the bioreactor data, which shows the percentage yield from the reactor when running various experiments where temperature was varied, impeller speed and the presence/absence of baffles were adjusted.
Build a linear model that uses the reactor temperature to predict the yield. Interpret the slope and intercept term.
Build a linear model that uses the impeller speed to predict yield. Interpret the slope and intercept term.
Build a linear model that uses the presence (represent it as 1) or absence (represent it as 0) of baffles to predict yield. Interpret the slope and intercept term.
Note: if you use R it will automatically convert the
baffles
variable to 1's and 0's for you. If you wanted to make the conversion yourself, to verify what R does behind the scenes, try this:# Read in the data frame bio <- read.csv('http://openmv.net/file/bioreactor-yields.csv') # Force the baffles variables to 0's and 1's bio$baffles <- as.numeric(bio$baffles) - 1
Which variable(s) would you change to boost the batch yield, at the lowest cost of implementation?
Use the
plot(bio)
function in R, wherebio
is the data frame you loaded using theread.csv(...)
function. R notices thatbio
is not a single variable, but a group of variables, i.e. a data frame, so it plots what is called a scatterplot matrix instead. Describe how the scatterplot matrix agrees with your interpretation of the slopes in parts 1, 2 and 3 of this question.
Solution
The R code (below) was used to answer all questions.
- The model is: \(\hat{y} = 102.5 - 0.69T\), where \(T\) is tank temperature.
- Intercept = \(102.5\) % points is the yield when operating at 0 \(^\circ \text{C}\). Obviously not a useful interpretation, because data have not been collected in a range that spans, or is even close to 0 \(^\circ \text{C}\). It is likely that this bioreactor system won't yield any product under such cold conditions. Further, a yield greater than 100% is not realizable.
- Slope = -0.69 \(\frac{[\%]}{[^\circ \text{C}]}\), indicating the yield decreases, on average, by about 0.7 units for every degree increase in tank temperature.
- The model is: \(\hat{y} = -20.3 + 0.016S\), where \(S\) is impeller speed.
- Intercept = \(-20.3\) % points is the yield when operating no agitation. Again, obviously not a useful interpretation, because the data have not been collected under these conditions, and yield can't be a negative quantity.
- Slope = 0.016 \(\frac{[\%]}{[\text{RPM}]}\), indicating the yield increases, on average, by about 1.6 percentage points per 100 RPM increase.
- The model is: \(\hat{y} = 54.9 - 16.7B\), where \(B\) is 1 if baffles are present and \(B=0\) with no baffles.
- Intercept = \(54.9\) % points yield is the yield when operating with no baffles (it is in fact the average yield of all the rows that have "No" as their baffle value).
- Slope = -16.7 %, indicating the presence of baffles decreases the yield, on average, by about 16.7 percentage points.
This is of course an open-ended, and case specific. Some factors you would include are:
- Remove the baffles, but take into account the cost of doing so. Perhaps it takes a long time (expense) to remove them, especially if the reactor is used to produce other products that do require the baffles.
- Operate at lower temperatures. The energy costs of cooling the reactor would factor into this.
- Operate at higher speeds and take that cost into account. Notice however there is one observation at 4900 RPM that seems unusual: was that due to the presence of baffles, or due to temperature in that run? We'll look into this issue with multiple linear regression later on.
Note
Please note that our calculations above are not the true effect of each of the variables (temperature, speed and baffles) on yield. Our calculations assume that there is no interaction between temperature, speed and baffles, and that each effect operates independent of the others. That's not necessarily true. After the midterm break we will learn how to "control for the effects" of other variables and build a single model with all variables.
The scatterplot matrix, shown below, agrees with our interpretation. This is an information rich visualization that gives us a feel for the multivariate relationships and really summarizes all the variables well (especially the last row of plots).
- The yield-temperature relationship is negative, as expected.
- The yield-speed relationship is positive, as expected.
- The yield-baffles relationship is negative, as expected.
- We can't tell anything about the yield-duration relationship, as it doesn't vary in the data we have (there could/should be a relationship, but we can't tell).
bio <- read.csv('http://openmv.net/file/bioreactor-yields.csv')
summary(bio)
# Temperature-Yield model
model.temp <- lm(bio$yield ~ bio$temperature)
summary(model.temp)
# Impeller speed-Yield model
model.speed <- lm(bio$yield ~ bio$speed)
summary(model.speed)
# Baffles-Yield model
model.baffles <- lm(bio$yield ~ bio$baffles)
summary(model.baffles)
# Scatterplot matrix
bitmap('../images/bioreactor-scatterplot-matrix-assignment.png', type="png256",
width=10, height=10, res=300)
plot(bio)
dev.off()
Question 2 [2]
Use the gas furnace data from the website to answer these questions. The data represent the gas flow rate (centered) from a process and the corresponding CO_{2} measurement.
- Make a scatter plot of the data to visualize the relationship between the variables. How would you characterize the relationship?
- Calculate the variance for both variables, the covariance between the two variables, and the correlation between them, \(r(x,y)\). Interpret the correlation value; i.e. do you consider this a strong correlation?
- Now calculate a least squares model relating the gas flow rate as the \(x\) variable to the CO_{2} measurement as the \(y\)-variable. Report the intercept and slope from this model.
- Report the \(R^2\) from the regression model. Compare the squared value of \(r(x,y)\) to \(R^2\). What do you notice? Now reinterpret what the correlation value means (i.e. compare this interpretation to your answer in part 2).
- 600-level: Switch \(x\) and \(y\) around and rebuild your least squares model. Compare the new \(R^2\) to the previous model's \(R^2\). Is this result surprising? How do interpret this?
Solution
Relationship: the data are negatively correlated.
I've chosen to use the
sp
orscatterplot
function from thecar
library. It shows the scatterplot smoother (a.k.a. loess line) as solid red, the spread around the smoother (dashed red), the least squares regression line (black) and boxplots for each axis.This is a great example of an information-rich visualization: packing the maximum amount of information into a small space. This plot answers so many questions we might have about the data.
The
cov(...)
command supplies the variance and covariance, and thecor(...)
command gives the correlation.- Variance of input gas flow rate = 1.15 [gas flow units] \(^2\)
- Variance of CO_{2} = 10.3 [CO_{2} units] \(^2\)
- Covariance between input gas flow and CO_{2} = -1.66 [gas flow units][CO_{2} units]
- Correlation = -0.48, i.e. around -0.5.
From my experience with data, I personally would interpret this as a reasonably strong correlation. There is reasonably strong linear behaviour in the data cloud shown above, enough of a relationship to confidently say that "the CO_{2} output does decrease at higher gas flow rates".
From the R model output:
- intercept is -1.44 units of CO_{2}
- slope is 53.4 \(\frac{[\text{units of CO}_2]}{[\text{units of gas flow}]}\)
- From the R model output: \(R^2 = 0.2347\)
- From earlier, the squared correlation is \((-0.484)^2 = 0.2347\), the same value.
- Correlation can be interpreted as the square root of the \(R^2\) value when regressing \(y\) on \(x\) (i.e. fitting a linear model to \(y\) using \(x\) as the input).
- Most novices would be misled and consider an \(R^2\) value of 0.23 quite low. But notice that there is a repeatable and consistent negative linear relationship between \(x\) and \(y\) in this data.
This shows the interesting result that when regressing \(x\) on \(y\) (instead of the usual regression of \(y\) on \(x\)), that we get the same \(R^2\) value. Note however that the intercept and slope are different between the two regressions.
This also calls into question the interpretation of the \(R^2\) value in regression. \(R^2\) is just the square of the correlation coefficient. Recall from class the slide on the Wikipedia examples of correlation: there were examples where \(r(x,y) = \sqrt{R^2}\) was zero, but still a strong relationship existing in the data. So we should interpret \(R^2\) as a measure only of the linear relationship between two variables. And bear its quadratic nature in mind - interpreting the correlation is actually easier, and more "linear", in that a 0.2 improvement in correlation means the same thing when going from \(r=0.2\) to 0.4, as it does when going from \(r=0.7\) to 0.9 (not so for \(R^2\)).
gas <- read.csv('http://openmv.net/file/gas-furnace.csv')
summary(gas)
bitmap('../images/CO2-gas-rate-raw-data-assignment5.png', type="png256",
width=6, height=6, res=300, pointsize=14)
library(car)
# Use the "sp" (scatterplot) function from the "car" library
sp(gas$InputGasRate, gas$CO2, xlab="Gas flow rate", ylab="CO2",
main="Scatterplot with smoother, spread, and L/S line")
dev.off()
# (Co)variance and correlation
cov(gas)
cor(gas)
# Linear model:
model <- lm(gas$CO2 ~ gas$InputGasRate)
summary(model)
# ANOVA values
y.mean <- mean(gas$CO2)
RegSS <- sum((predict(model) - y.mean)^2)
RSS <- sum(residuals(model)^2)
TSS <- sum((gas$CO2 - y.mean)^2)
mean.square.residual <- RSS / model$df.residual
# Test normality of residuals
bitmap('../images/residuals-gas-furnace-question-assignment.png', type="png256",
width=6, height=6, res=300, pointsize=14)
library(car)
qqPlot(model) # the qqPlot "knows" what to do with a model object
dev.off()
Question 3 [1.5]
A new type of thermocouple is being investigated by your group. These devices produce an almost linear voltage (millivolt) response at different temperatures. In practice though it is used the other way around: use the millivolt reading to predict the temperature. The process of fitting this linear model is called calibration.
Use the following data to calibrate a linear model:
Temperature [K] 273 293 313 333 353 373 393 413 433 453 Reading [mV] 0.01 0.12 0.24 0.38 0.51 0.67 0.84 1.01 1.15 1.31 Show the linear model and provide the predicted temperature when reading 1.00 mV.
Are you satisfied with this model, based on the coefficient of determination (\(R^2\)) value?
What is the model's standard error? Now, are you satisfied with the model's prediction ability, given that temperatures can usually be recorded to an accuracy of \(\pm 0.5\) K with most inexpensive thermocouples.
What is your (revised) conclusion now about the usefulness of the \(R^2\) value?
Note: This example explains why I don't use the terminology of independent and dependent variables in this course. Here the temperature truly is the independent variable, because it causes the voltage difference that we measure. But the voltage reading is the independent variable in the least squares model. The word independent is being used in two different senses (its English meaning vs its mathematical meaning), and this can be misleading.
Solution
The linear model is used to predict temperature given the reading in millivolts. The reason is that in modelling, in general, we specify as \(x\) the variable(s) we always have available, while \(y\) is the variable we would like to predict from the \(x\).
The model has the form: \(T = b_0 + b_1V\), where \(T\) is temperature and \(V\) is the voltage reading. Coefficients in the linear model are:
\[T = 278.6 + 135.3 V\]implies that recording an increase in 0.1 mV means, on average, the temperature has increased by 13.5 K in the system.
The temperature prediction at 1.00 mV would be 413.9 K.
The following code was used to fit the model and draw the plot.
MATLAB code Python code x = [0.01, 0.12, 0.24, 0.38, 0.51, 0.67, 0.84, 1.01, 1.15, 1.31]; y = [273, 293, 313, 333, 353, 373, 393, 413, 433, 453]; y = y(:); n = numel(x); % the number of observations X = [ones(n,1) x(:)]; a = inv(X'*X)*X'*y; % Contains a_0=a(1) and a_1=a(2) %a = regress(y, X); % only if you have the Statistics Toolbox % Additional calculations resids = y - X*a; % resids = e = y - Xa RSS = resids' * resids; % residual sum of squares TSS = sum((y - mean(y)).^2); % total sum of squares R2 = 1 - RSS/TSS; std_error = sqrt(RSS/(n-numel(a))); % Plot the data along with the fitted line: plot(x, y, 'o') hold('on') plot(x, X*a, 'r') grid('on') xlabel('Voltage [mV]') ylabel('Temperature [K]') legend({'Original data', 'Fitted line'}, 'Location', 'Best') text(0.8, 325, sprintf('Standard error = %0.1f K', std_error)) plot(x, X*a + 2*std_error, 'r--') plot(x, X*a - 2*std_error, 'r--') print('-dpng', '../images/voltage-linear-model-MATLAB.png')
import numpy as np from matplotlib.pyplot import * x = np.array([0.01, 0.12, 0.24, 0.38, 0.51, 0.67, 0.84, 1.01, 1.15, 1.31]) y = np.array([273, 293, 313, 333, 353, 373, 393, 413, 433, 453]) n = np.max(x.shape) # the number of observations X = np.vstack([np.ones(n), x]).T # Simpler, and more accurate way: a = np.linalg.lstsq(X, y)[0] # Additional calculations resids = y - np.dot(X,a) # e = y - Xa; RSS = sum(resids**2) # residual sum of squares TSS = sum((y - np.mean(y))**2) # total sum of squares R2 = 1 - RSS/TSS std_error = np.sqrt(RSS/(n-len(a))) # Plot the data along with the fitted line: fig = figure() plot(x, y, 'o', label='Original data', markersize=10) plot(x, np.dot(X,a), 'r', label='Fitted line') grid('on') xlabel('Voltage [mV]') ylabel('Temperature [K]') legend(loc=0) text(0.8, 325, 'Standard error = %0.1f K' % std_error) plot(x, np.dot(X,a)+2*std_error, 'r--') plot(x, np.dot(X,a)-2*std_error, 'r--') fig.savefig('../images/voltage-linear-model.png')
If you used
R
to fit the model, you would written something like this:> V <- c(0.01, 0.12, 0.24, 0.38, 0.51, 0.67, 0.84, 1.01, 1.15, 1.31) > T <- c(273, 293, 313, 333, 353, 373, 393, 413, 433, 453) > model <- lm(T ~ V) > summary(model) Call: lm(formula = T ~ V) Residuals: Min 1Q Median 3Q Max -6.9272 -2.1212 -0.1954 2.7480 5.4239 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 278.574 2.204 126.39 1.72e-14 *** V 135.298 2.922 46.30 5.23e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.916 on 8 degrees of freedom Multiple R-squared: 0.9963, Adjusted R-squared: 0.9958 F-statistic: 2144 on 1 and 8 DF, p-value: 5.229e-11
The \(R^2\) value from this linear fit is \(R^2 = 0.996\), which being so close to 1.0, implies the linear relationship in the data is strong (the linear model fits the data very well) - that's all.
One cannot be satisfied with only an \(R^2\) value: it has nothing to do with whether the model's prediction accuracy is any good. So we can't tell anything from this number.
The model's standard error is 3.9 K. If we assume the prediction error is normally distributed around the linear fit, this corresponds to one standard deviation. So 95% of our prediction error lies roughly within a range of \(\pm 2\times 3.92\) or \(\pm 7.8\) K. These are the dashed red lines drawn on the figure. (Please note: the true error intervals are not parallel to the regression line, they are curved; however the \(\pm 2S_E\) limits are a good-enough approximation for most engineering applications.
This prediction ability of \(\pm 8\) K is probably not satisfying for most engineering applications, since we can predict temperatures far more accurately, over the range from 273K to 453K, using off-the-shelf commercial thermocouples.
The purpose of this question is to mainly point out the misleading nature of \(R^2\) - this value looks really good: 99.6%, yet the actual purpose of the model, the ability to predict temperature from the millivolt reading, has no relationship at all to this \(R^2\) value.
Question 4 [1]
Use the linear model you derived in Question 2, where you used the gas flow rate to predict the CO_{2} measurement, and construct the analysis of variance table (ANOVA) for the dataset. Use your ANOVA table to reproduce the residual standard error, \(S_E\) value, that you get from the R software output.
Go through the R tutorial to learn how to efficiently obtain the residuals and predicted values from a linear model object.
Also for linear model in Question 2, verify whether the residuals are normally distributed.
Use the linear model you derived in Question 3, where you used the voltage measurement to predict the temperature, and construct the analysis of variance table (ANOVA) for that dataset. Use your ANOVA table to reproduce the residual standard error, \(S_E\) value, that you get from the R software output.
Solution
The ANOVA table values were calculated in the code solutions for question 2:
Type of variance Distance Degrees of freedom SSQ Mean square Regression \(\hat{y}_i - \overline{\mathrm{y}}\) \(k-2\) 709.9 354.9 Error \(y_i - \hat{y}_i\) \(n-k\) 2314.9 7.87 Total \(y_i - \overline{\mathrm{y}}\) \(n\) 3024.8 10.2 The residual standard error, or just standard error, \(S_E = \sqrt{\frac{2314.9}{296-2}} = 2.8\) %CO_{2}, which agrees with the value from R.
These residuals are normally distributed, as verified in the following qq plot:
As mentioned in the
help(qqPlot)
output, the dashed red line is the confidence envelope at the 95% level. The single point just outside the confidence envelope is not going to have any practical effect on our assumption of normality. We expect 1 point in 20 to lie outside the limits.We will discuss the meaning of "studentized residuals", on the \(y\)-axis, after the midterm.
For the question 3 data set:
Type of variance Distance Degrees of freedom SSQ Mean square Regression \(\hat{y}_i - \overline{\mathrm{y}}\) \(k-2\) 32877 16438 Error \(y_i - \hat{y}_i\) \(n-k\) 122.7 15.3 Total \(y_i - \overline{\mathrm{y}}\) \(n\) 33000 3300 The residual standard error, or just standard error, \(S_E = \sqrt{\frac{122.7}{10-2}} = 3.9\) K, which agrees with the value from R.