Difference between revisions of "Least squares modelling"

Latest revision as of 20:20, 14 January 2019

Learning outcomes

Understand the difference between correlation and covariance.
What the objective function of least squares does
Understand and use an analysis of variance table
Calculate and interpret the confidence intervals from a least squares model
Know about the assumptions required to interpret least squares model coefficients
Use the prediction error range from the model
Identify outlier points and classify them
Use the linear model when there are multiple predictor variables (this is what we are building up towards; we will use multiple linear regression extensively in the next topic)

Resources

Class notes 2015
Class notes 2014
Textbook, chapter 4
Quizzes (with solutions): attempt these after you have watched the videos

Tasks to do first	Quiz	Solution
Watch videos 1, 2, and 3, and complete step 16 of the Software tutorial	Quiz	Solution
Watch videos 4, 5 and 6, and also see the Thermocouple example code (below)	Quiz	Solution
Watch videos 7 and 8, and complete steps 17 and 18 of the Software tutorial	Quiz	Solution
Watch videos 9, 10 and 11, and complete step 19 of the Software tutorial	Quiz	Solution
Watch videos 12, 13, 14 and 15, and complete steps 21 and 22 of the Software tutorial	Quiz	Solution

Advanced students should read in the course textbook about leverage, discrepancy, and influential points. Once you've done that, please complete step 20 of the Software tutorial and use the code below to detects points with high leverage, discrepancy, and influence.

Extended readings/practice

Run the code below to see how to build and use a linear model in R, but see step 16 and onwards in the R tutorial as well.
Try some practice problems.
Describe a linear regression model you have made for a lab report.
- What was the \(R^2\) value?
- How did you calculate the regression model values?
- Use the same data from your report and instead calculate the standard error, \(S_E\). How do you interpret that \(S_E\) value now?
Do the YouTube challenge: find a video on YouTube that explains the central limit theorem, or the confidence interval, or least squares in a way that is different to explained in class (hopefully you find better explanations than mine). Share the link with a friend in your class.
Does eating chocolate lead to winning a Nobel prize?

Class videos from prior years

Videos from 2015

Watch all these videos in this YouTube playlist

Overview of this module [02:59]
Covariance and correlation [09:46]
Why least squares, and some other alternatives [covered in class]
Some of the math behind the LS model [09:44]
Understanding the analysis of variance (ANOVA) [11:58]
Interpretation of standard error [covered in class]
Assumptions to derive LS confidence intervals [05:45]
Confidence intervals interpreted and used in 3 LS examples [11:54]
Prediction intervals from the least squares model [04:24]
Checking for violations of the least squares assumptions (1 of 2) [07:27]
Checking for violations of the least squares assumptions (2 of 2) [11:46]
Introducing multiple linear regression - why we need to use it [2:59]
MLR - the matrix equation form and an example [11:25]
Interpreting the MLR model coefficients and confidence intervals [04:58]
Integer variables in the multiple linear regression model [09:51]

02:59 | Download video | Download captions | Script

09:46 | Download video | Download captions | Script

Covered in class | No video | Script

09:44 | Download video | Download captions | Script

11:58 | Download video | Download captions | Script

Covered in class | No video | Script

05:45 | Download video | Download captions | Script

11:54 | Download video | Download captions | Script

04:24 | Download video | Download captions | Script

07:27 | Download video | Download captions | Script

11:46 | Download video | Download captions | Script

02:59 | Download video | Download captions | Script

11:25 | Download video | Download captions | Script

04:58 | Download video | Download captions | Script

09:51 | Download video | Download captions | Script

Videos from 2014

See the webpage from 2014

Videos from 2013

See the webpage from 2013

Software codes for this section

Code to show how to build and plot a least squares model in R

Run this code in a web-browser


x <- c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5)
y <- c(8.04, 6.95, 7.58, 8.81, 8.33, 
       9.96, 7.24, 4.26, 10.84, 4.82, 5.68)
plot(x,y)
model.ls <- lm(y ~ x)
summary(model.ls)

coef(model.ls)
confint(model.ls)

names(model.ls)
model.ls$resisduals
resid(model.ls)

plot(x, y)
abline(model.ls)

Thermocouple example

Compare the \(R^2\) and \(S_E\) values. What does each of those values tell you about the model?
How do you use a least squares model in R to make a prediction on new data?
Check that the linear regression assumptions are satisfied.
Run this code in a web-browser


V <- c(0.01, 0.12, 0.24, 0.38, 0.51, 
       0.67, 0.84, 1.01, 1.15, 1.31)
T <- c(273, 293, 313, 333, 353, 
       373, 393, 413, 433, 453)
plot(V, T)
model <- lm(T ~ V)
summary(model)
coef(model)

# Get the coefficient confidence intervals
confint(model)

# Model's residuals
resid(model)
library(car)

# q-q plot of the residuals to check normality
qqPlot(resid(model)) 

plot(V, T)
v.new <- seq(0, 1.5, 0.1)
t.pred <- coef(model)[1] + coef(model)[2] * v.new
lines(v.new, t.pred, type="l", col="blue")

# Plot x against the residuals to check for non-linearity
plot(V, resid(model))
abline(h=0)

# Plot the raw data and the regression line in red
plot(V, T)
abline(model, col="red")

Build the linear regression in R for video 4

Run this code in a web-browser



taste <- c(12.3, 20.9, 25.9, 37.3, 5.5, 
           16.8, 38.9, 54.9,  57.2, 6.4)
H2S  <- c(3.14, 5.04, 7.60, 8.72, 4.79, 
          3.66, 9.06, 6.75, 7.91, 4.70)
 
# Always plot your data first:
plot(H2S, taste)
 
# Build a linear model
model <- lm(taste ~ H2S)
 
n <- length(taste)

# Model's predictions
y.hat <- predict(model)   

# error = observed - predicted
errors <- taste - y.hat
 
# Now calculate various sums of squares
TSS <-  sum((taste - mean(taste))**2)
RegSS <- sum((y.hat - mean(taste))**2)
RSS <- sum(errors**2)
 
# The standard error will be our reference value to judge a model's performance
paste0('The SE = ', round( sqrt(RSS/(n-2)),2) )
 
# Did you know you can do this in R?
plot(model)

Complete least squares regression example in R, with testing and checking afterwards

Run this code in a web-browser


bio <- read.csv('http://openmv.net/file/bioreactor-yields.csv')
summary(bio)
 
# Plot data
plot(bio, main="Raw data")
 
# For convenience:
y <- bio$yield
x <- bio$temperature
 
# Build linear model: interpret coeff, CI and SE
model <- lm(y ~ x)
summary(model)
confint(model)
 
# Visualize predictions
plot(x, y, main="Model with prediction line")
abline(model)
 
# Residuals normal?
library(car)
qqPlot(model)
 
# Structure in residuals?
plot(resid(model), main="Model residuals")
abline(h=0, col="blue")
 
# Residuals in time order?
# You might have to rearrange the row order,
# the plot using the same code as previous plot
 
# Look at the autocorrelation to check for lack of
# independence (can be inaccurate for small datasets)
acf(resid(model), main="ACF of residuals")
 
# Predictions vs residuals
plot(predict(model), resid(model), 
     main="Predictions vs residuals")
abline(h=0, col="blue")
 
# x-data vs residuals
plot(x, resid(model), main="x vs residuals")
abline(h=0, col="blue")
identify(x, resid(model))
 
# Predictions-vs-y
plot(y, predict(model), main="y vs residuals")
abline(a=0, b=1, col="blue")
identify(y, predict(model))

Test for influence, discrepancy and leverage in R with a least squares model

You have to run the code below 3 times: each for model A, B and C.
Run this code in a web-browser


# Create some data to demonstrate with
# (ordinarily you'd use your own "x" and "y" variables)
N = 25
set.seed(41)
x <- rnorm(N, sd=4, mean=10)
y <- x*4 - 6 + log(x) + rnorm(N, sd=3)
 
# Discrepant point (model A)
x[N+1] = 11.4
y[N+1] = 72.6
 
# Influential point (model B)
x[N+1] = 25
y[N+1] = 42
 
# High leverage point (model C)
x[N+1] = 25
y[N+1] = 92.6
 
# Build the model 3 times: once for A, B and C
# by commenting out the relevant lines.
# 
# For example, comment out lines for B and C when 
# trying to understand what a discrepant point in 
# model A is.
# 
# Point number 26 will always be the problematic
# point in these examples.
 
model <- lm(y~x)
summary(model)
 
plot(x, y, main="x vs y raw data")
abline(model, col="blue")
identify(x, y)  # you'd pick out point #26 here
 
# Leverage: hatvalues
plot(hatvalues(model), ylim=c(0,1), 
     main="Hat values")
N <- length(x)
avg.hat <- 2/N
abline(h=2*avg.hat, col="darkgreen")
abline(h=3*avg.hat, col="red")
identify(hatvalues(model)) # you'd pick out point #26 here
 
# Discrepancy: rstudent
plot(rstudent(model), main="Studentized residuals")
abline(h=c(-2, 2), col="red")
 
# Influence: Cook's D
plot(cooks.distance(model), main="Cook's D")
K <- length(model$coefficients)
cutoff <- 4/(N-K)
abline(h=cutoff, col="red")
identify(cooks.distance(model)) # you'd pick out point #26 here
 
build <- seq(1,N)
remove <- -c(26)  # remove point 26 and rebuild the model
model.update <- lm(model, subset=build[remove])
 
# Did the model improve?
library(car)
influenceIndexPlot(model.update)

Build a multiple linear regression in R (manually and automatically)

Compare the manual and automatic results to ensure the model coefficients are the same.
Also see the example after this one
Run this code in a web-browser


# Calculate the model manually
x1.raw <- c(1,3,4,7,9,9)
x2.raw <- c(9,9,6,3,1,2)
y.raw  <- c(3,5,6,8,7,10)
n <- length(x1.raw)

x1 <- x1.raw - mean(x1.raw)
x2 <- x2.raw - mean(x2.raw)
y <- y.raw - mean(y.raw)

X <- cbind(x1, x2)

# Calculate:  b = inv(X'X) X'y
XTX <- t(X) %*% X    # compare this to cov(X)*(n-1)
XTY <- t(X) %*% y
XTX.inv <- solve(XTX)
b <- XTX.inv %*% XTY
b

# Let R calculate the model:
model <- lm(y.raw ~ x1.raw + x2.raw)
summary(model)

Build the multiple linear regression in R for video 13

This is the same as the prior example, except we use the centered values directly
Run this code in a web-browser


x1 <- c(-4.5, -2.5, -1.5, 1.5, 3.5, 3.5)
x2 <- c(   4,    4,    1,  -2,  -4,  -3)
y <- matrix(c(-3.5, -1.5, -0.5, 1.5, 0.5, 3.5), 
            nrow=6, ncol=1)
X <- cbind(x1, x2)

XtX <- t(X) %*% X
Xty <- t(X) %*% y

XtX.inv <- solve(XtX)

b <- XtX.inv %*% Xty

summary(lm(y ~ x1 + x2))

Difference between revisions of "Least squares modelling"

Latest revision as of 20:20, 14 January 2019

Contents

Learning outcomes

Resources

Extended readings/practice

Class videos from prior years

Videos from 2015

Videos from 2014

Videos from 2013

Software codes for this section

Code to show how to build and plot a least squares model in R

Thermocouple example

Build the linear regression in R for video 4

Complete least squares regression example in R, with testing and checking afterwards

Test for influence, discrepancy and leverage in R with a least squares model

Build a multiple linear regression in R (manually and automatically)

Build the multiple linear regression in R for video 13

Navigation menu

@@ Line 8: / Line 8: @@
 * Use the prediction error range from the model
 * Identify outlier points and classify them
-* Use the linear model when there are multiple predictor variables (this is what we are building up towards; we will use this extensively in the next topic)
+* Use the linear model when there are multiple predictor variables (this is what we are building up towards; we will use multiple linear regression extensively in the [[Design_and_analysis_of_experiments| next topic]])
-== Extended readings/practice ==
+== Resources ==
-* Run the code below to see how to build and use a linear model in R, but see [[Software_tutorial | step 16 and onwards]] in the R tutorial as well.
-* Try some [[Practice_questions|practice problems]].
-* Describe a linear regression model you have made for a lab report.
-** What was the  \(R^2\) value?
-** How did you calculate the regression model values?
-** Use the same data from your report and instead calculate the standard error, \(S_E\). How do you interpret that \(S_E\) value now?
-* Do the YouTube challenge: find a video on YouTube that explains the central limit theorem, or the confidence interval, or least squares in a way that is different to explained in class (hopefully you find better explanations than mine). Share the link with a friend in your class.
-* [http://www.nejm.org/doi/full/10.1056/NEJMon1211064 Does eating chocolate lead to winning a Nobel prize]?
-== Resources ==
-<!--
 * [[Image:Nuvola_mimetypes_pdf.png|20px|link=Media:2015-4C3-6C3-Least-squares-modelling.pdf]]  [[Media:2015-4C3-6C3-Least-squares-modelling.pdf| Class notes 2015]]
 * [[Image:Nuvola_mimetypes_pdf.png|20px|link=Media:2014-4C3-6C3-Least-squares-modelling.pdf]] [[Media:2014-4C3-6C3-Least-squares-modelling.pdf | Class notes 2014]]
@@ Line 32: / Line 22: @@
 ! Solution
 |-
-| Complete steps 10, 11, 12 and 13 of the [http://learnche.mcmaster.ca/4C3/Software_tutorial software tutorial]
+| Watch videos 1, 2, and 3, and complete step 16 of the [[Software tutorial]]
-(also steps 1 through 9)
+| [https://docs.google.com/document/d/1qI0J62y4dSIZAWhVfORLkQ9VBuF0YtcEGLsG_gVKpmM  Quiz]
-| [ Quiz]
+| [https://docs.google.com/document/d/1ycOCRHCegJoFH9BE7v3pu2qVQjgSOfWF8FtZXx1Xv44 Solution]
-| [ Solution]
 |-
-| Watch videos 1, 2, 3,  4, and 5
+| Watch videos 4, 5 and 6, and also see the Thermocouple example code (below)
-| [https://docs.google.com/document/d/19NnM1v7yut1q0HFaHMQmNg6zMH2UPN1rrHZl7l38qBI  Quiz]
+| [https://docs.google.com/document/d/1vAF1ManMFZwdfzz2p-7czwE2ytp8zVDVtZY8MIdMpCw Quiz]
-| [https://docs.google.com/document/d/1YcF4PxxfQqnz4UqgGZMWv3zfFmmjB7uSkNfQ9XMe97A Solution]
+| [https://docs.google.com/document/d/1SjauwrKIhtanlDmog8uZoSueFG0xqWIYCPdhtUyXUn8 Solution]
 |-
-| Watch videos 6, 7, and 8
+| Watch videos 7 and 8, and complete steps 17 and 18 of the [[Software tutorial]]
-| [https://docs.google.com/document/d/1BySja0o1O8MtZOitV42BU2bnjWmzH_vARcXCbh5OviI Quiz]
+| [https://docs.google.com/document/d/1O_P1oVSOr1ZQE9_GZcC5K4JwzL1rzdh_YicCvk58L1g  Quiz]
-| [https://docs.google.com/document/d/17jKKHwgVLkBT04R3OgUr9TqSzCGkxVYBP-fZtuiszig Solution]
+| [https://docs.google.com/document/d/1oiKzCLniZbWhI0zlOTFwnVPD3koP9J-pcnAbYdULoHE Solution]
 |-
-| Watch videos 9 and 10
+| Watch videos 9, 10 and 11, and complete step 19 of the [[Software tutorial]]
-| [https://docs.google.com/document/d/1qMBqum4kdgLBK3dlzXncZJm42NKdp8pnsp9jxG_dKXA Quiz]
+| [https://docs.google.com/document/d/1taLSvyikn90a8dzZ3Nnd3mhwbcaxdq_6sa0csVpf2Og  Quiz]
-| [https://docs.google.com/document/d/19mCY3qCpVKLzvOBpa-c1t7JDzLTl0xagnivRpslnTD0 Solution]
+| [https://docs.google.com/document/d/1r6T598o56FymYsD3rnkhbZMaA9n9Qua9KY8osEZtObw Solution]
 |-
-| Watch videos 11, 12, and 13
+| Watch videos 12, 13, 14 and 15, and complete steps 21 and 22 of the [[Software tutorial]]
-| [https://docs.google.com/document/d/1zxOUn30wTeg8xiJfIgfZ3WKgiUk3hqrnCV-sDTrlFiA Quiz]
+| [https://docs.google.com/document/d/1PRI02nD-Ds_Et8RxZaWBh5vH1dPVoYMgsna-BxzNPVE Quiz]
-| [https://docs.google.com/document/d/19RozNWVpmQo5oJOfw2SDhWZrJ2Sbpq2LuWFq8NsLN6g Solution]
+| [https://docs.google.com/document/d/1wn7F1FevnUKciyoHclBdoODU2nTM36Cbel0Ur_Se6yk Solution]
-|-
-| Watch videos 14, 15, and 16
-| [https://docs.google.com/document/d/1ZJ1aZ-hvS11-k5btkBqb1X16RAihbFTDjX-fNX8iQm8 Quiz]
-| [https://docs.google.com/document/d/10h2qavaw08AYC5qGAWyemXrgro0F-ARBaDnoE52S77Y Solution]
 |}
+* Advanced students should read in the course textbook about ''leverage'', ''discrepancy'', and ''influential points''. Once you've done that, please complete step 20 of the [[Software tutorial]]  and use the code below to detects points with high leverage, discrepancy, and influence.
+== Extended readings/practice ==
+* Run the code below to see how to build and use a linear model in R, but see [[Software_tutorial | step 16 and onwards]] in the R tutorial as well.
+* Try some [[Practice_questions|practice problems]].
+* Describe a linear regression model you have made for a lab report.
+** What was the  \(R^2\) value?
+** How did you calculate the regression model values?
+** Use the same data from your report and instead calculate the standard error, \(S_E\). How do you interpret that \(S_E\) value now?
+* Do the YouTube challenge: find a video on YouTube that explains the central limit theorem, or the confidence interval, or least squares in a way that is different to explained in class (hopefully you find better explanations than mine). Share the link with a friend in your class.
+* [http://www.nejm.org/doi/full/10.1056/NEJMon1211064 Does eating chocolate lead to winning a Nobel prize]?
+== Class videos from prior years ==
+===Videos from 2015===
+Watch all these videos in [https://www.youtube.com/watch?v=RW_8yKbMzUA&index=18&list=PLHUnYbefLmeOPRuT1sukKmRyOVd4WSxJE this YouTube playlist]
+# Overview of this module [02:59]
+# Covariance and correlation [09:46]
+# Why least squares, and some other alternatives [covered in class]
+# Some of the math behind the LS model [09:44]
+# Understanding the analysis of variance (ANOVA) [11:58]
+# Interpretation of standard error [covered in class]
+# Assumptions to derive LS confidence intervals [05:45]
+# Confidence intervals interpreted and used in 3 LS examples [11:54]
+# Prediction intervals from the least squares model [04:24]
+# Checking for violations of the least squares assumptions (1 of 2) [07:27]
+# Checking for violations of the least squares assumptions (2 of 2) [11:46]
+# Introducing multiple linear regression - why we need to use it [2:59]
+# MLR - the matrix equation form and an example [11:25]
+# Interpreting the MLR model coefficients and confidence intervals [04:58]
+# Integer variables in the multiple linear regression model [09:51]
+{{VideoBox
+|ytid = RW_8yKbMzUA
+|timing=02:59
+|shortcut=reg-01
+}}
+{{VideoBox
+|ytid = tXOCOMtSWrc
+|timing=09:46
+|shortcut=reg-02
+}}
+{{VideoBox
+|ytid =
+|timing=Covered in class
+|shortcut=reg-03
+}}
+{{VideoBox
+|ytid = 8d_pbx4vnsI
+|timing=09:44
+|shortcut=reg-04
+}}
+{{VideoBox
+|ytid = xIjAD_6nXto
+|timing=11:58
+|shortcut=reg-05
+}}
+{{VideoBox
+|ytid =
+|timing=Covered in class
+|shortcut=reg-06
+}}
+{{VideoBox
+|ytid = Qls1R2HOzy0
+|timing=05:45
+|shortcut=reg-07
+}}
+{{VideoBox
+|ytid = sY8CVMGUD54
+|timing=11:54
+|shortcut=reg-08
+}}
+{{VideoBox
+|ytid = N8NF1_CBTw4
+|timing=04:24
+|shortcut=reg-09
+}}
+{{VideoBox
+|ytid = phufT1KE9Sk
+|timing=07:27
+|shortcut=reg-10
+}}
+{{VideoBox
+|ytid = 7fd8Qu1i3Dk
+|timing=11:46
+|shortcut=reg-11
+}}
+{{VideoBox
+|ytid = qiv1nBCfwBg
+|timing=02:59
+|shortcut=reg-12
+}}
+{{VideoBox
+|ytid = dkfY0OKH12g
+|timing=11:25
+|shortcut=reg-13
+}}
+{{VideoBox
+|ytid = FkZZiv3J6xQ
+|timing=04:58
+|shortcut=reg-14
+}}
+{{VideoBox
+|ytid = TrhG-XhnBK4
+|timing=09:51
+|shortcut=reg-15
+}}
+===Videos from 2014===
+[[Least squares modelling (2014)|See the webpage from 2014]]
+<!--
+{{#widget:YouTube|id=Ja1tJeDPUds}}
+{{#widget:YouTube|id=UZz4vA49pPg}}
+{{#widget:YouTube|id=pkgwO8q0Oig}}
+{{#widget:YouTube|id=qeuQNS-g05w}}
+{{#widget:YouTube|id=QuO4R_ynd3M}}
+{{#widget:YouTube|id=4PmKI3lOQXk}}
+{{#widget:YouTube|id=AkJcGmmT2S8}}
+{{#widget:YouTube|id=pVzMPZYFzGE}}
 -->
+===Videos from 2013===
+[[Least squares modelling (2013)|See the webpage from 2013]]
+<!--
+{{#widget:Vimeo|id=59609654}}
+{{#widget:Vimeo|id=59763486}}
+{{#widget:Vimeo|id=60592349}}
+{{#widget:Vimeo|id=60688909}}
+{{#widget:Vimeo|id=60862565}}
+{{#widget:Vimeo|id=61138805}}
+{{#widget:Vimeo|id=61211690}}
+-->
+== Software codes for this section ==
+=== Code to show how to build and plot a least squares model in R ===
+[http://www.r-fiddle.org/#/fiddle?id=tOdY4LEG Run this code in a web-browser]
+<html><div data-datacamp-exercise data-lang="r" data-height="500px">
+	<code data-type="sample-code">
+x <- c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5)
+y <- c(8.04, 6.95, 7.58, 8.81, 8.33,
+.96, 7.24, 4.26, 10.84, 4.82, 5.68)
+plot(x,y)
+model.ls <- lm(y ~ x)
+summary(model.ls)
+coef(model.ls)
+confint(model.ls)
+names(model.ls)
+model.ls$resisduals
+resid(model.ls)
+plot(x, y)
+abline(model.ls)
+	</code>
+</div></html>
+=== Thermocouple example  ===
+* Compare the \(R^2\) and \(S_E\) values. What does each of those values tell you about the model?
+* How do you use a least squares model in R to make a prediction on new data?
+* Check that the linear regression assumptions are satisfied.
+* [http://www.r-fiddle.org/#/fiddle?id=iomphczu Run this code in a web-browser]
+<html><div data-datacamp-exercise data-lang="r">
+	<code data-type="sample-code">
+V <- c(0.01, 0.12, 0.24, 0.38, 0.51,
+.67, 0.84, 1.01, 1.15, 1.31)
+T <- c(273, 293, 313, 333, 353,
+, 393, 413, 433, 453)
+plot(V, T)
+model <- lm(T ~ V)
+summary(model)
+coef(model)
+# Get the coefficient confidence intervals
+confint(model)
+# Model's residuals
+resid(model)
+library(car)
+# q-q plot of the residuals to check normality
+qqPlot(resid(model))
+plot(V, T)
+v.new <- seq(0, 1.5, 0.1)
+t.pred <- coef(model)[1] + coef(model)[2] * v.new
+lines(v.new, t.pred, type="l", col="blue")
+# Plot x against the residuals to check for non-linearity
+plot(V, resid(model))
+abline(h=0)
+# Plot the raw data and the regression line in red
+plot(V, T)
+abline(model, col="red")
+	</code>
+</div></html>
+=== Build the linear regression in R for video 4 ===
+* [http://www.r-fiddle.org/#/fiddle?id=3Ex27Wkb Run this code in a web-browser]
+<html><div data-datacamp-exercise data-lang="r">
+    <code data-type="sample-code">
+taste <- c(12.3, 20.9, 25.9, 37.3, 5.5,
+.8, 38.9, 54.9,  57.2, 6.4)
+H2S  <- c(3.14, 5.04, 7.60, 8.72, 4.79,
+.66, 9.06, 6.75, 7.91, 4.70)
+# Always plot your data first:
+plot(H2S, taste)
+# Build a linear model
+model <- lm(taste ~ H2S)
+n <- length(taste)
+# Model's predictions
+y.hat <- predict(model)
+# error = observed - predicted
+errors <- taste - y.hat
+# Now calculate various sums of squares
+TSS <-  sum((taste - mean(taste))**2)
+RegSS <- sum((y.hat - mean(taste))**2)
+RSS <- sum(errors**2)
+# The standard error will be our reference value to judge a model's performance
+paste0('The SE = ', round( sqrt(RSS/(n-2)),2) )
+# Did you know you can do this in R?
+plot(model)
+    </code>
+</div></html>
+=== Complete least squares regression example in R, with testing and checking afterwards ===
+* [http://www.r-fiddle.org/#/fiddle?id=ZUtwiezn Run this code in a web-browser]
+<html><div data-datacamp-exercise data-lang="r" data-height="600px">
+    <code data-type="sample-code">
+bio <- read.csv('http://openmv.net/file/bioreactor-yields.csv')
+summary(bio)
+# Plot data
+plot(bio, main="Raw data")
+# For convenience:
+y <- bio$yield
+x <- bio$temperature
+# Build linear model: interpret coeff, CI and SE
+model <- lm(y ~ x)
+summary(model)
+confint(model)
+# Visualize predictions
+plot(x, y, main="Model with prediction line")
+abline(model)
+# Residuals normal?
+library(car)
+qqPlot(model)
+# Structure in residuals?
+plot(resid(model), main="Model residuals")
+abline(h=0, col="blue")
+# Residuals in time order?
+# You might have to rearrange the row order,
+# the plot using the same code as previous plot
+# Look at the autocorrelation to check for lack of
+# independence (can be inaccurate for small datasets)
+acf(resid(model), main="ACF of residuals")
+# Predictions vs residuals
+plot(predict(model), resid(model),
+     main="Predictions vs residuals")
+abline(h=0, col="blue")
+# x-data vs residuals
+plot(x, resid(model), main="x vs residuals")
+abline(h=0, col="blue")
+identify(x, resid(model))
+# Predictions-vs-y
+plot(y, predict(model), main="y vs residuals")
+abline(a=0, b=1, col="blue")
+identify(y, predict(model))
+    </code>
+</div></html>
+=== Test for influence, discrepancy and leverage in R with a least squares model===
+* You have to run the code below 3 times: each for model A, B and C.
+* [http://www.r-fiddle.org/#/fiddle?id=LOhZV12t Run this code in a web-browser]
+<html><div data-datacamp-exercise data-lang="r"  data-height="600px">
+    <code data-type="sample-code">
+# Create some data to demonstrate with
+# (ordinarily you'd use your own "x" and "y" variables)
+N = 25
+set.seed(41)
+x <- rnorm(N, sd=4, mean=10)
+y <- x*4 - 6 + log(x) + rnorm(N, sd=3)
+# Discrepant point (model A)
+x[N+1] = 11.4
+y[N+1] = 72.6
+# Influential point (model B)
+x[N+1] = 25
+y[N+1] = 42
+# High leverage point (model C)
+x[N+1] = 25
+y[N+1] = 92.6
+# Build the model 3 times: once for A, B and C
+# by commenting out the relevant lines.
+#
+# For example, comment out lines for B and C when
+# trying to understand what a discrepant point in
+# model A is.
+#
+# Point number 26 will always be the problematic
+# point in these examples.
+model <- lm(y~x)
+summary(model)
+plot(x, y, main="x vs y raw data")
+abline(model, col="blue")
+identify(x, y)  # you'd pick out point #26 here
+# Leverage: hatvalues
+plot(hatvalues(model), ylim=c(0,1),
+     main="Hat values")
+N <- length(x)
+avg.hat <- 2/N
+abline(h=2*avg.hat, col="darkgreen")
+abline(h=3*avg.hat, col="red")
+identify(hatvalues(model)) # you'd pick out point #26 here
+# Discrepancy: rstudent
+plot(rstudent(model), main="Studentized residuals")
+abline(h=c(-2, 2), col="red")
+# Influence: Cook's D
+plot(cooks.distance(model), main="Cook's D")
+K <- length(model$coefficients)
+cutoff <- 4/(N-K)
+abline(h=cutoff, col="red")
+identify(cooks.distance(model)) # you'd pick out point #26 here
+build <- seq(1,N)
+remove <- -c(26)  # remove point 26 and rebuild the model
+model.update <- lm(model, subset=build[remove])
+# Did the model improve?
+library(car)
+influenceIndexPlot(model.update)
+    </code>
+</div></html>
+=== Build a multiple linear regression in R (manually and automatically) ===
+* Compare the manual and automatic results to ensure the model coefficients are the same.
+* Also see the example after this one
+* [http://www.r-fiddle.org/#/fiddle?id=Wrn7Z8BC Run this code in a web-browser]
+<html><div data-datacamp-exercise data-lang="r" data-height="auto">
+    <code data-type="sample-code">
+# Calculate the model manually
+x1.raw <- c(1,3,4,7,9,9)
+x2.raw <- c(9,9,6,3,1,2)
+y.raw  <- c(3,5,6,8,7,10)
+n <- length(x1.raw)
+x1 <- x1.raw - mean(x1.raw)
+x2 <- x2.raw - mean(x2.raw)
+y <- y.raw - mean(y.raw)
+X <- cbind(x1, x2)
+# Calculate:  b = inv(X'X) X'y
+XTX <- t(X) %*% X    # compare this to cov(X)*(n-1)
+XTY <- t(X) %*% y
+XTX.inv <- solve(XTX)
+b <- XTX.inv %*% XTY
+b
+# Let R calculate the model:
+model <- lm(y.raw ~ x1.raw + x2.raw)
+summary(model)
+    </code>
+</div></html>
+=== Build the multiple linear regression in R for video 13 ===
+* This is the same as the prior example, except we use the centered values directly
+* [http://www.r-fiddle.org/#/fiddle?id=Wrn7Z8BC Run this code in a web-browser]
+<html><div data-datacamp-exercise data-lang="r" data-height="auto">
+    <code data-type="sample-code">
+x1 <- c(-4.5, -2.5, -1.5, 1.5, 3.5, 3.5)
+x2 <- c(   4,    4,    1,  -2,  -4,  -3)
+y <- matrix(c(-3.5, -1.5, -0.5, 1.5, 0.5, 3.5),
+            nrow=6, ncol=1)
+X <- cbind(x1, x2)
+XtX <- t(X) %*% X
+Xty <- t(X) %*% y
+XtX.inv <- solve(XtX)
+b <- XtX.inv %*% Xty
+summary(lm(y ~ x1 + x2))
+    </code>
+</div></html>

Difference between revisions of "Least squares modelling"

Latest revision as of 20:20, 14 January 2019

Learning outcomes

Resources

Extended readings/practice

Class videos from prior years

Videos from 2015

Videos from 2014

Videos from 2013

Software codes for this section

Code to show how to build and plot a least squares model in R

Thermocouple example

Build the linear regression in R for video 4

Complete least squares regression example in R, with testing and checking afterwards

Test for influence, discrepancy and leverage in R with a least squares model

Build a multiple linear regression in R (manually and automatically)

Build the multiple linear regression in R for video 13

Navigation menu

Search