Learning outcomes
- Understand the difference between correlation and covariance.
- What the objective function of least squares does
- Understand and use an analysis of variance table
- Calculate and interpret the confidence intervals from a least squares model
- Know about the assumptions required to interpret least squares model coefficients
- Use the prediction error range from the model
- Identify outlier points and classify them
- Use the linear model when there are multiple predictor variables (this is what we are building up towards; we will use this extensively in the next topic)
Extended readings/practice
- Run the code below to see how to build and use a linear model in R, but see step 16 and onwards in the R tutorial as well.
- Try some practice problems.
- Describe a linear regression model you have made for a lab report.
- What was the \(R^2\) value?
- How did you calculate the regression model values?
- Use the same data from your report and instead calculate the standard error, \(S_E\). How do you interpret that \(S_E\) value now?
- Do the YouTube challenge: find a video on YouTube that explains the central limit theorem, or the confidence interval, or least squares in a way that is different to explained in class (hopefully you find better explanations than mine). Share the link with a friend in your class.
- Does eating chocolate lead to winning a Nobel prize?
Resources
Class videos from prior years
Videos from 2015
- Overview of this module [02:59]
- Covariance and correlation [09:46]
- Why least squares, and some other alternatives [covered in class]
- Some of the math behind the LS model [09:44]
- Understanding the analysis of variance (ANOVA) [11:58]
- Interpretation of standard error [covered in class]
- Assumptions to derive LS confidence intervals [05:45]
- Confidence intervals interpreted and used in 3 LS examples [11:54]
- Prediction intervals from the least squares model [04:24]
- Checking for violations of the least squares assumptions (1 of 2) [07:27]
- Checking for violations of the least squares assumptions (2 of 2) [11:46]
- Introducing multiple linear regression - why we need to use it [2:59]
- MLR - the matrix equation form and an example [11:25]
- Interpreting the MLR model coefficients and confidence intervals [04:58]
- Integer variables in the multiple linear regression model [09:51]
|
Covered in class | No video | Script
|
|
Covered in class | No video | Script
|
Videos from 2014
See the webpage from 2014
Videos from 2013
See the webpage from 2013
Software codes for this section
Code to show how to build and plot a least squares model in R
Try this code in a web-browser
x <- c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5)
y <- c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68)
plot(x,y)
model.ls <- lm(y ~ x)
summary(model.ls)
coef(model.ls)
confint(model.ls)
names(model.ls)
model.ls$resisduals
resid(model.ls)
plot(x, y)
abline(model.ls)
Thermocouple example
- Compare the \(R^2\) and \(S_E\) values. What does each of those values tell you about the model?
- How do you use a least squares model in R to make a prediction on new data?
- Check that the linear regression assumptions are satisfied.
- Try this code in a web-browser
V <- c(0.01, 0.12, 0.24, 0.38, 0.51, 0.67, 0.84, 1.01, 1.15, 1.31)
T <- c(273, 293, 313, 333, 353, 373, 393, 413, 433, 453)
plot(V, T)
model <- lm(T ~ V)
summary(model)
coef(model)
confint(model) # get the coefficient confidence intervals
resid(model) # model residuals
library(car)
qqPlot(resid(model)) # q-q plot of the residuals to check normality
plot(V, T)
v.new <- seq(0, 1.5, 0.1)
t.pred <- coef(model)[1] + coef(model)[2] * v.new
lines(v.new, t.pred, type="l", col="blue")
# Plot x against the residuals to check for non-linearity
plot(V, resid(model))
abline(h=0)
# Plot the raw data and the regression line in red
plot(V, T)
abline(model, col="red")
Build the linear regression in R for video 4
Taste <- c(12.3, 20.9, 25.9, 37.3, 5.5, 16.8, 38.9, 54.9, 57.2, 6.4)
H2S <- c(3.14, 5.04, 7.60, 8.72, 4.79, 3.66, 9.06, 6.75, 7.91, 4.70)
# Always plot your data first
plot(H2S, Taste)
# Build a linear model
model <- lm(Taste~H2S)
# Create a short name for "y", the output variable, Taste
y <- Taste
n <- length(Taste)
y.hat <- predict(model) # model predictions
errors <- y - y.hat # error = observed - predicted
# Now calculate various sums of squares
TSS <- sum((y - mean(y))**2)
RegSS <- sum((y.hat - mean(y))**2)
RSS <- sum(errors**2)
# The standard error will be our reference value to judge a model's performance
sqrt(RSS/(n-2))
# Did you know you can do this in R?
plot(model)
Complete least squares regression example in R, with testing and checking afterwards
bio <- read.csv('http://openmv.net/file/bioreactor-yields.csv')
summary(bio)
# Plot data
plot(bio)
# For convenience:
y <- bio$yield
x <- bio$temperature
# Build linear model: interpret coeff, CI and SE
model <- lm(y ~ x)
summary(model)
confint(model)
# Visualize predictions
plot(x, y)
abline(model)
# Residuals normal?
library(car)
qqPlot(model)
# Structure in residuals?
plot(resid(model))
abline(h=0, col="blue")
# Residuals in time order?
# You might have to rearrange the row order,
# the plot using the same code as previous plot
# Look at the autocorrelation to check for lack of
# independence (can be inaccurate for small datasets)
acf(resid(model))
# Predictions vs residuals
plot(predict(model), resid(model))
abline(h=0, col="blue")
# x-data vs residuals
plot(x, resid(model))
abline(h=0, col="blue")
identify(x, resid(model))
# Predictions-vs-y
plot(y, predict(model))
abline(a=0, b=1, col="blue")
identify(y, predict(model))
Test for influence, discrepancy and leverage in R with a least squares model
# Create some data to demonstrate with
# (ordinarily you'd use your own "x" and "y" variables)
N = 25
set.seed(41)
x <- rnorm(N, sd=4, mean=10)
y <- x*4 - 6 + log(x) + rnorm(N, sd=3)
# Discrepant point (model A)
x[N+1] = 11.4
y[N+1] = 72.6
# Influential point (model B)
x[N+1] = 25
y[N+1] = 42
# High leverage point (model C)
x[N+1] = 25
y[N+1] = 92.6
# Build the model 3 times: once for A, B and C
# by commmenting out the relevant lines.
#
# For example, comment out lines for B and C when
# trying to understand what a discrepant point in
# model A is.
#
# Point number 26 will always be the problematic
# point in these examples.
model <- lm(y~x)
summary(model)
plot(x, y)
abline(model, col="blue")
identify(x, y) # you'd pick out point #26 here
# Leverage: hatvalues
plot(hatvalues(model), ylim=c(0,1))
N <- length(x)
avg.hat <- 2/N
abline(h=2*avg.hat, col="darkgreen")
abline(h=3*avg.hat, col="red")
identify(hatvalues(model)) # you'd pick out point #26 here
# Discrepancy: rstudent
plot(rstudent(model))
abline(h=c(-2, 2), col="red")
# Influence: Cook's D
plot(cooks.distance(model))
K <- length(model$coefficients)
cutoff <- 4/(N-K)
abline(h=cutoff, col="red")
identify(cooks.distance(model)) # you'd pick out point #26 here
build <- seq(1,N)
remove <- -c(26) # remove point 26 and rebuild the model
model.update <- lm(model, subset=build[remove])
# Did the model improve?
library(car)
influenceIndexPlot(model.update)
Build a multiple linear regression in R (manually and automatically)
- Compare the manual and automatic results to ensure the model coefficients are the same.
- Also see the example after this one
- Try this code in a web-browser
# Calculate the model manually
x1.raw <- c(1,3,4,7,9,9)
x2.raw <- c(9,9,6,3,1,2)
y.raw <- c(3,5,6,8,7,10)
n <- length(x1.raw)
x1 <- x1.raw - mean(x1.raw)
x2 <- x2.raw - mean(x2.raw)
y <- y.raw - mean(y.raw)
X <- cbind(x1, x2)
# Calculate: b = inv(X'X) X'y
XTX <- t(X) %*% X # compare this to cov(X)*(n-1)
XTY <- t(X) %*% y
XTX.inv <- solve(XTX)
b <- XTX.inv %*% XTY
b
# Let R calculate the model:
model <- lm(y.raw ~ x1.raw + x2.raw)
summary(model)
Build the multiple linear regression in R for video 13
x1 <- c(-4.5, -2.5, -1.5, 1.5, 3.5, 3.5)
x2 <- c( 4, 4, 1, -2, -4, -3)
y <- matrix(c(-3.5, -1.5, -0.5, 1.5, 0.5, 3.5), nrow=6, ncol=1)
X <- cbind(x1, x2)
XtX <- t(X) %*% X
Xty <- t(X) %*% y
XtX.inv <- solve(XtX)
b <- XtX.inv %*% Xty
summary(lm(y ~ x1 + x2))