Software tutorial/Investigating outliers, discrepancies and other influential points

From Statistics for Engineering
Jump to navigation Jump to search
← Transformation of data in a linear model (previous step) Tutorial index Next step: Linear models with multiple X-variables (MLR) →


<rst> <rst-options: 'toc' = False/> <rst-options: 'reset-figures' = False/> Recall that **influence** is a measure of how each observation affects the model. We use plots of these numbers to decide on how we should treat an outlier.

For this section we will use a data set that is built into R, the stack-loss data set. It is 21 days of plant data from oxidizing ammonia to nitric acid.

.. code-block:: s

attach(stackloss) summary(stackloss)

# Air.Flow Water.Temp Acid.Conc. stack.loss # Min. :50.00 Min. :17.00 Min. :72.00 Min.  : 7.00 # 1st Qu.:56.00 1st Qu.:18.00 1st Qu.:82.00 1st Qu.:11.00 # Median :58.00 Median :20.00 Median :87.00 Median :15.00 # Mean :60.43 Mean :21.10 Mean :86.29 Mean :17.52 # 3rd Qu.:62.00 3rd Qu.:24.00 3rd Qu.:89.00 3rd Qu.:19.00 # Max. :80.00 Max. :27.00 Max. :93.00 Max. :42.00

We will consider only the effect of air flow and stack loss (stack loss here in an inverse measure of plant efficiency). Type ``help(stackloss)`` for more details about the data. Build the model and investigate the normality of the residuals.

.. code-block:: s

model <- lm(stack.loss ~ Air.Flow) library(car)

qqPlot(resid(model), id.method = "identify")

# Now use the mouse to click on the outliers and ID them # Right-click to stop adding points

From clicking on the points we see that observations 1, 3, 4 and 21 are quite unusual. </rst>

Residuals-stackloss-data.jpg

<rst> <rst-options: 'toc' = False/> <rst-options: 'reset-figures' = False/> These observations have residuals larger than what would be expected from a normal distribution. We don't exclude them yet. Let's examine if they appear in some of the other plots.

Leverage


This is a plot of the hat-values:

.. code-block:: s

plot(hatvalues(model)) K <- length(coef(model)) N <- nrow(stackloss) hat.avg <- K/N abline(h=c(2,3)*hat.avg, lty=2, col=c("orange", "red")) identify(hatvalues(model))

The average hat value is at :math:`\overline{h} = k/n`. Observations 1 and 2 lie beyond only the 2 times the average hat value. We are more concerned with points that have hat values beyond 3 times the average. </rst>

Hats-plot-stackloss.jpg

<rst> <rst-options: 'toc' = False/> <rst-options: 'reset-figures' = False/> Discrepancy


The discrepancy of each data point is found by plotting the studentized residuals:

.. code-block:: s

plot(rstudent(model)) abline(h=0, lty=2) abline(h=c(-2,2), lty=2, col="red") identify(rstudent(model))

Recall the cut-offs are at :math:`\pm 2` and contain 95% of the data (1 in 20 observations will naturally lie outside these limits). Observations 4 and 21 lie outside the limits and should be investigated. Specifically observation 4 is under-predicted (error = observed - predicted), while observation 21 is over-predicted.

Was there anything extra going on with those observations? Can another variable be included into the model that will reduce the size of the residuals for those points? </rst>

Rstudent-stackloss.jpg

<rst> <rst-options: 'toc' = False/> <rst-options: 'reset-figures' = False/> Influence


A plot of the Cook's D values shows influence. We loosely can describe influence as:

.. math::

\text{Influence} = \text{Leverage} \times \text{Discrepancy}

though Cook's D calculates leverage and discrepancy slightly differently from shown previously.

.. code-block:: s

plot(cooks.distance(model)) cutoff <- 4/model$df.residual abline(h=cutoff, lty=2, col=c("orange", "red")) identify(cooks.distance(model))

The cutoff for Cook's D is :math:`4/(n-k)`. Observations 1 and 21 lie beyond only the cutoff in this data set. Why did observation 1 show up, but not observation 2 (*answer*: both originally had high leverage, but observation 1 has higher residual error than observation 2). </rst>

CooksD-stackloss.jpg

<rst> <rst-options: 'toc' = False/> <rst-options: 'reset-figures' = False/> Combine leverage, discrepancy and influence to understand outliers


.. code-block:: s

library(car) # Let the function auto-identify the outliers; tell it which labels to use influencePlot(model, id.method="noteworthy", labels=row.names(stackloss))

# Manually identify outliers with your mouse influencePlot(model, id.method="identify", labels=row.names(stackloss))

The auto-identify function marks only observations with large Cook's distance values. You should still investigate the other points. </rst>

InfluencePlot-stackloss.jpg

<rst> <rst-options: 'toc' = False/> <rst-options: 'reset-figures' = False/> Also see ``influence(model)`` and ``influence.measures(model)`` for other metrics used to assess influence on a model from each observation.

Also, the ``influenceIndexPlot(model)`` function from the ``car`` library shows 4 different metrics of influence in 4 subplots and is useful for smallish data sets.

Removing outliers and rebuilding the model


After investigation of the points, we may decide to remove point 1 and 21 and rebuild the model:

.. code-block:: s

remove = -c(1, 21) model.rebuild <- lm(model, subset=remove)

Note how easy it is rebuild the model: you give it the existing ``model`` and the observations to remove (note the "``-``" in front of the ``c()``).

Then re-investigate the influence plot:

.. code-block:: s

influencePlot(model.rebuild, labels=row.names(stackloss)[remove]) </rst>

InfluencePlot-rebuild-stackloss.jpg
← Transformation of data in a linear model (previous step) Tutorial index Next step: Linear models with multiple X-variables (MLR) →