Least squares modelling

From Statistics for Engineering
Jump to navigation Jump to search

Learning outcomes

  • Understand the difference between correlation and covariance.
  • What the objective function of least squares does
  • Understand and use an analysis of variance table
  • Calculate and interpret the confidence intervals from a least squares model
  • Know about the assumptions required to interpret least squares model coefficients
  • Use the prediction error range from the model
  • Identify outlier points and classify them
  • Use the linear model when there are multiple predictor variables (this is what we are building up towards; we will use this extensively in the next topic)

Extended readings/practice

  • Run the code below to see how to build and use a linear model in R, but see step 16 and onwards in the R tutorial as well.
  • Try some practice problems.
  • Describe a linear regression model you have made for a lab report.
    • What was the \(R^2\) value?
    • How did you calculate the regression model values?
    • Use the same data from your report and instead calculate the standard error, \(S_E\). How do you interpret that \(S_E\) value now?
  • Do the YouTube challenge: find a video on YouTube that explains the central limit theorem, or the confidence interval, or least squares in a way that is different to explained in class (hopefully you find better explanations than mine). Share the link with a friend in your class.
  • Does eating chocolate lead to winning a Nobel prize?

Resources

Tasks to do first Quiz Solution
Watch videos 1, 2, and 3, and complete step 16 of the Software_tutorial Quiz Solution
Watch videos 4, 5 and 6, and also see the Thermocouple example code (below) Quiz Solution
Watch videos 7 and 8, and complete steps 17 and 18 of the Software_tutorial Quiz Solution
Watch videos 9, 10 and 11 and complete step 19 of the Software_tutorial Quiz Solution
Watch videos 12, 13, 14 and 15 Quiz Solution

Class videos from prior years

Videos from 2015

  1. Overview of this module [02:59]
  2. Covariance and correlation [09:46]
  3. Why least squares, and some other alternatives [covered in class]
  4. Some of the math behind the LS model [09:44]
  5. Understanding the analysis of variance (ANOVA) [11:58]
  6. Interpretation of standard error [covered in class]
  7. Assumptions to derive LS confidence intervals [05:45]
  8. Confidence intervals interpreted and used in 3 LS examples [11:54]
  9. Prediction intervals from the least squares model [04:24]
  10. Checking for violations of the least squares assumptions (1 of 2) [07:27]
  11. Checking for violations of the least squares assumptions (2 of 2) [11:46]
  12. Introducing multiple linear regression - why we need to use it [2:59]
  13. MLR - the matrix equation form and an example [11:25]
  14. Interpreting the MLR model coefficients and confidence intervals [04:58]
  15. Integer variables in the multiple linear regression model [09:51]
02:59 | Download video | Download captions | Script
09:46 | Download video | Download captions | Script
Covered in class | No video | Script
09:44 | Download video | Download captions | Script
11:58 | Download video | Download captions | Script
Covered in class | No video | Script
05:45 | Download video | Download captions | Script
11:54 | Download video | Download captions | Script
04:24 | Download video | Download captions | Script
07:27 | Download video | Download captions | Script
11:46 | Download video | Download captions | Script
02:59 | Download video | Download captions | Script
11:25 | Download video | Download captions | Script
04:58 | Download video | Download captions | Script
09:51 | Download video | Download captions | Script


Videos from 2014

See the webpage from 2014

Videos from 2013

See the webpage from 2013

Software codes for this section

Code to show how to build and plot a least squares model in R

Try this code in a web-browser

x <- c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5)
y <- c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68)
plot(x,y)
model.ls <- lm(y ~ x)
summary(model.ls)

coef(model.ls)
confint(model.ls)

names(model.ls)
model.ls$resisduals
resid(model.ls)

plot(x, y)
abline(model.ls)

Thermocouple example

  • Compare the \(R^2\) and \(S_E\) values. What does each of those values tell you about the model?
  • How do you use a least squares model in R to make a prediction on new data?
  • Check that the linear regression assumptions are satisfied.
  • Try this code in a web-browser
V <- c(0.01, 0.12, 0.24, 0.38, 0.51, 0.67, 0.84, 1.01, 1.15, 1.31)
T <- c(273, 293, 313, 333, 353, 373, 393, 413, 433, 453)
plot(V, T)
model <- lm(T ~ V)
summary(model)
coef(model)
confint(model)   # get the coefficient confidence intervals
resid(model)     # model residuals
library(car)
qqPlot(resid(model)) # q-q plot of the residuals to check normality

plot(V, T)
v.new <- seq(0, 1.5, 0.1)
t.pred <- coef(model)[1] + coef(model)[2] * v.new
lines(v.new, t.pred, type="l", col="blue")

# Plot x against the residuals to check for non-linearity
plot(V, resid(model))
abline(h=0)

# Plot the raw data and the regression line in red
plot(V, T)
abline(model, col="red")

Build the linear regression in R for video 4

Taste <- c(12.3,	20.9,	25.9,	37.3,	5.5,	16.8,	38.9,	54.9, 	57.2, 	6.4)
H2S 	<- c(3.14,	5.04,	7.60,	8.72,	4.79,	3.66,	9.06,	6.75,	7.91,	4.70)
 
# Always plot your data first
plot(H2S, Taste)
 
# Build a linear model
model <- lm(Taste~H2S)
 
# Create a short name for "y", the output variable, Taste
y <- Taste
n <- length(Taste)

y.hat <- predict(model)   # model predictions
errors <- y - y.hat       # error = observed - predicted
 
# Now calculate various sums of squares
TSS <-  sum((y - mean(y))**2)
RegSS <- sum((y.hat - mean(y))**2)
RSS <- sum(errors**2)
 
# The standard error will be our reference value to judge a model's performance
sqrt(RSS/(n-2))
 
#  Did you know you can do this in R?
plot(model)


Complete least squares regression example in R, with testing and checking afterwards

bio <- read.csv('http://openmv.net/file/bioreactor-yields.csv')
summary(bio)
 
# Plot data
plot(bio)
 
# For convenience:
y <- bio$yield
x <- bio$temperature
 
# Build linear model: interpret coeff, CI and SE
model <- lm(y ~ x)
summary(model)
confint(model)
 
# Visualize predictions
plot(x, y)
abline(model)
 
# Residuals normal?
library(car)
qqPlot(model)
 
# Structure in residuals?
plot(resid(model))
abline(h=0, col="blue")
 
# Residuals in time order?
# You might have to rearrange the row order,
# the plot using the same code as previous plot
 
# Look at the autocorrelation to check for lack of
# independence (can be inaccurate for small datasets)
acf(resid(model))
 
# Predictions vs residuals
plot(predict(model), resid(model))
abline(h=0, col="blue")
 
# x-data vs residuals
plot(x, resid(model))
abline(h=0, col="blue")
identify(x, resid(model))
 
# Predictions-vs-y
plot(y, predict(model))
abline(a=0, b=1, col="blue")
identify(y, predict(model))

Test for influence, discrepancy and leverage in R with a least squares model

# Create some data to demonstrate with
# (ordinarily you'd use your own "x" and "y" variables)
N = 25
set.seed(41)
x <- rnorm(N, sd=4, mean=10)
y <- x*4 - 6 + log(x) + rnorm(N, sd=3)
 
# Discrepant point (model A)
x[N+1] = 11.4
y[N+1] = 72.6
 
# Influential point (model B)
x[N+1] = 25
y[N+1] = 42
 
# High leverage point (model C)
x[N+1] = 25
y[N+1] = 92.6
 
# Build the model 3 times: once for A, B and C
# by commmenting out the relevant lines.
# 
# For example, comment out lines for B and C when 
# trying to understand what a discrepant point in 
# model A is.
# 
# Point number 26 will always be the problematic
# point in these examples.
 
model <- lm(y~x)
summary(model)
 
plot(x, y)
abline(model, col="blue")
identify(x, y)  # you'd pick out point #26 here
 
# Leverage: hatvalues
plot(hatvalues(model), ylim=c(0,1))
N <- length(x)
avg.hat <- 2/N
abline(h=2*avg.hat, col="darkgreen")
abline(h=3*avg.hat, col="red")
identify(hatvalues(model)) # you'd pick out point #26 here
 
# Discrepancy: rstudent
plot(rstudent(model))
abline(h=c(-2, 2), col="red")
 
# Influence: Cook's D
plot(cooks.distance(model))
K <- length(model$coefficients)
cutoff <- 4/(N-K)
abline(h=cutoff, col="red")
identify(cooks.distance(model)) # you'd pick out point #26 here
 
build <- seq(1,N)
remove <- -c(26)  # remove point 26 and rebuild the model
model.update <- lm(model, subset=build[remove])
 
# Did the model improve?
library(car)
influenceIndexPlot(model.update)

Build a multiple linear regression in R (manually and automatically)

  • Compare the manual and automatic results to ensure the model coefficients are the same.
  • Also see the example after this one
  • Try this code in a web-browser
# Calculate the model manually
x1.raw <- c(1,3,4,7,9,9)
x2.raw <- c(9,9,6,3,1,2)
y.raw  <- c(3,5,6,8,7,10)
n <- length(x1.raw)

x1 <- x1.raw - mean(x1.raw)
x2 <- x2.raw - mean(x2.raw)
y <- y.raw - mean(y.raw)

X <- cbind(x1, x2)

# Calculate:  b = inv(X'X) X'y
XTX <- t(X) %*% X    # compare this to cov(X)*(n-1)
XTY <- t(X) %*% y
XTX.inv <- solve(XTX)
b <- XTX.inv %*% XTY
b

# Let R calculate the model:
model <- lm(y.raw ~ x1.raw + x2.raw)
summary(model)

Build the multiple linear regression in R for video 13

x1 <- c(-4.5,  -2.5,  -1.5,  1.5,  3.5,  3.5)
x2 <- c(   4,     4,     1,   -2,   -4,   -3)
y <- matrix(c(-3.5,  -1.5,  -0.5,  1.5,  0.5,  3.5), nrow=6, ncol=1)
X <- cbind(x1, x2)

XtX <- t(X) %*% X
Xty <- t(X) %*% y

XtX.inv <- solve(XtX)

b <- XtX.inv %*% Xty

summary(lm(y ~ x1 + x2))