4.9. Summary of steps to build and investigate a linear model

  1. Plot the data to assess model structure and degree of correlation between the \(\mathrm{x}\) and \(\mathrm{y}\) variable.

    plot(x, y)           # plot the raw data
    lines(lowess(x,y))   # superimpose non-parametric smoother to see correlation
  2. Fit the model and examine the printed output.

    model <- lm(y ~ x)   # fit the model: "y as described by variable x"
    • Investigate the model’s standard error, how does it compare to the range of the \(\mathrm{y}\) variable?

    • Calculate confidence intervals for the model parameters and interpret them.

  3. Visualize the model’s predictions in the context of the model building data.

    plot(x, y)
    lines(lowess(x,y))        # show the smoother
    abline(model, col="red")  # and show the least squares model
  4. Plot a normal probability plot, or a q-q plot, of the residuals. Are they normally distributed? If not, investigate if a transformation of the \(\mathrm{y}\) variable might improve them. But also see the additional plots on checking for non-linearity and consider adding extra explanatory variables.

  5. Plot the residuals against the \(\mathrm{x}\)-values. We expect to see no particular structure. If you see trends in the data, it indicates that a transformation of the \(\mathrm{x}\) variable might be appropriate, or that there are unmodelled phenomena in the \(\mathrm{y}\) variable - we might need an additional \(\mathrm{x}\) variable.

    plot(x, resid(model))
    abline(h=0, col="red")
  6. Plot the residuals in time (sequence) order. We expect to see no particular trends in the data. If there are patterns in the plot, assess whether autocorrelation is present in the \(\mathrm{y}\) variable (use the acf(y) function in R). If so, you might have to sub-sample the data, or resort to proper time-series analysis tools to fit your model.

    abline(h=0, col="red")
    lines(lowess(resid(model), f=0.2))   # use a shorter smoothing span
  7. Plot the residuals against the fitted-values. By definition of the least-squares model, the covariance between the residuals and the fitted values is zero. You can verify that \(e^T\hat{y} = \sum_i^n{e_i\hat{y}_i} = 0\). A fan-shape to the residuals indicates the residual variance is not constant over the range of data: you will have to use weighted least squares to counteract that. It is better to use studentized residuals, rather than the actual residuals, since the actual residuals can show non-constant variance even though the errors have constant variance.

    plot(predict(model), rstudent(model))
    lines(lowess(predict(model), rstudent(model)))
    abline(h=0, col="red")
  8. Plot the predictions of \(\mathrm{y}\) against the actual values of \(\mathrm{y}\). We expect the data to fall around a 45 degree line.

    plot(y, predict(model))
    lines(lowess(y, predict(model), f=0.5))     # a smoother
    abline(a=0, b=1, col="red")                 # a 45 degree line