Least squares modelling

Learning outcomes

• Understand the difference between correlation and covariance.
• What the objective function of least squares does
• Understand and use an analysis of variance table
• Calculate and interpret the confidence intervals from a least squares model
• Know about the assumptions required to interpret least squares model coefficients
• Use the prediction error range from the model
• Identify outlier points and classify them
• Use the linear model when there are multiple predictor variables (this is what we are building up towards; we will use multiple linear regression extensively in the next topic)

Resources

Tasks to do first Quiz Solution
Watch videos 1, 2, and 3, and complete step 16 of the Software tutorial Quiz Solution
Watch videos 4, 5 and 6, and also see the Thermocouple example code (below) Quiz Solution
Watch videos 7 and 8, and complete steps 17 and 18 of the Software tutorial Quiz Solution
Watch videos 9, 10 and 11, and complete step 19 of the Software tutorial Quiz Solution
Watch videos 12, 13, 14 and 15, and complete steps 21 and 22 of the Software tutorial Quiz Solution
• Advanced students should read in the course textbook about leverage, discrepancy, and influential points. Once you've done that, please complete step 20 of the Software tutorial and use the code below to detects points with high leverage, discrepancy, and influence.

• Run the code below to see how to build and use a linear model in R, but see step 16 and onwards in the R tutorial as well.
• Try some practice problems.
• Describe a linear regression model you have made for a lab report.
• What was the $$R^2$$ value?
• How did you calculate the regression model values?
• Use the same data from your report and instead calculate the standard error, $$S_E$$. How do you interpret that $$S_E$$ value now?
• Do the YouTube challenge: find a video on YouTube that explains the central limit theorem, or the confidence interval, or least squares in a way that is different to explained in class (hopefully you find better explanations than mine). Share the link with a friend in your class.
• Does eating chocolate lead to winning a Nobel prize?

Class videos from prior years

Videos from 2015

Watch all these videos in this YouTube playlist

1. Overview of this module [02:59]
2. Covariance and correlation [09:46]
3. Why least squares, and some other alternatives [covered in class]
4. Some of the math behind the LS model [09:44]
5. Understanding the analysis of variance (ANOVA) [11:58]
6. Interpretation of standard error [covered in class]
7. Assumptions to derive LS confidence intervals [05:45]
8. Confidence intervals interpreted and used in 3 LS examples [11:54]
9. Prediction intervals from the least squares model [04:24]
10. Checking for violations of the least squares assumptions (1 of 2) [07:27]
11. Checking for violations of the least squares assumptions (2 of 2) [11:46]
12. Introducing multiple linear regression - why we need to use it [2:59]
13. MLR - the matrix equation form and an example [11:25]
14. Interpreting the MLR model coefficients and confidence intervals [04:58]
15. Integer variables in the multiple linear regression model [09:51]
 Covered in class | No video | Script
 Covered in class | No video | Script

Software codes for this section

Code to show how to build and plot a least squares model in R

x <- c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5)
y <- c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68)
plot(x,y)
model.ls <- lm(y ~ x)
summary(model.ls)

coef(model.ls)
confint(model.ls)

names(model.ls)
x <- bio$temperature # Build linear model: interpret coeff, CI and SE model <- lm(y ~ x) summary(model) confint(model) # Visualize predictions plot(x, y) abline(model) # Residuals normal? library(car) qqPlot(model) # Structure in residuals? plot(resid(model)) abline(h=0, col="blue") # Residuals in time order? # You might have to rearrange the row order, # the plot using the same code as previous plot # Look at the autocorrelation to check for lack of # independence (can be inaccurate for small datasets) acf(resid(model)) # Predictions vs residuals plot(predict(model), resid(model)) abline(h=0, col="blue") # x-data vs residuals plot(x, resid(model)) abline(h=0, col="blue") identify(x, resid(model)) # Predictions-vs-y plot(y, predict(model)) abline(a=0, b=1, col="blue") identify(y, predict(model))  Test for influence, discrepancy and leverage in R with a least squares model # Create some data to demonstrate with # (ordinarily you'd use your own "x" and "y" variables) N = 25 set.seed(41) x <- rnorm(N, sd=4, mean=10) y <- x*4 - 6 + log(x) + rnorm(N, sd=3) # Discrepant point (model A) x[N+1] = 11.4 y[N+1] = 72.6 # Influential point (model B) x[N+1] = 25 y[N+1] = 42 # High leverage point (model C) x[N+1] = 25 y[N+1] = 92.6 # Build the model 3 times: once for A, B and C # by commmenting out the relevant lines. # # For example, comment out lines for B and C when # trying to understand what a discrepant point in # model A is. # # Point number 26 will always be the problematic # point in these examples. model <- lm(y~x) summary(model) plot(x, y) abline(model, col="blue") identify(x, y) # you'd pick out point #26 here # Leverage: hatvalues plot(hatvalues(model), ylim=c(0,1)) N <- length(x) avg.hat <- 2/N abline(h=2*avg.hat, col="darkgreen") abline(h=3*avg.hat, col="red") identify(hatvalues(model)) # you'd pick out point #26 here # Discrepancy: rstudent plot(rstudent(model)) abline(h=c(-2, 2), col="red") # Influence: Cook's D plot(cooks.distance(model)) K <- length(model$coefficients)
cutoff <- 4/(N-K)
abline(h=cutoff, col="red")
identify(cooks.distance(model)) # you'd pick out point #26 here

build <- seq(1,N)
remove <- -c(26)  # remove point 26 and rebuild the model
model.update <- lm(model, subset=build[remove])

# Did the model improve?
library(car)
influenceIndexPlot(model.update)


Build a multiple linear regression in R (manually and automatically)

• Compare the manual and automatic results to ensure the model coefficients are the same.
• Also see the example after this one
• Run this code in a web-browser
# Calculate the model manually
x1.raw <- c(1,3,4,7,9,9)
x2.raw <- c(9,9,6,3,1,2)
y.raw  <- c(3,5,6,8,7,10)
n <- length(x1.raw)

x1 <- x1.raw - mean(x1.raw)
x2 <- x2.raw - mean(x2.raw)
y <- y.raw - mean(y.raw)

X <- cbind(x1, x2)

# Calculate:  b = inv(X'X) X'y
XTX <- t(X) %*% X    # compare this to cov(X)*(n-1)
XTY <- t(X) %*% y
XTX.inv <- solve(XTX)
b <- XTX.inv %*% XTY
b

# Let R calculate the model:
model <- lm(y.raw ~ x1.raw + x2.raw)
summary(model)


Build the multiple linear regression in R for video 13

x1 <- c(-4.5,  -2.5,  -1.5,  1.5,  3.5,  3.5)
x2 <- c(   4,     4,     1,   -2,   -4,   -3)
y <- matrix(c(-3.5,  -1.5,  -0.5,  1.5,  0.5,  3.5), nrow=6, ncol=1)
X <- cbind(x1, x2)

XtX <- t(X) %*% X
Xty <- t(X) %*% y

XtX.inv <- solve(XtX)

b <- XtX.inv %*% Xty

summary(lm(y ~ x1 + x2))