Software tutorial/Linear models with integer variables

From Statistics for Engineering
Jump to: navigation, search
← Linear models with multiple X-variables (MLR) (previous step) Tutorial index


The only additional step required to include an integer variable is to tell R that the variable a factor or categorical type variable. R will then take care of expanding it into the extra columns required to fit the linear model.

The rest of the tools for linear models are then used as normal, e.g. confint(model), predict(model, ...) and so on.

Let's start by creating a factor variable ourself. Create a vector of "Pass" and "Fail" entries and convert it to a factor variable:

pass.fail <- c("Pass", "Fail", "Fail", "Fail", "Fail", "Pass", "Pass")

# What type of variable is this currently (i.e. what type of class of variable)?
class(pass.fail)
# [1] "character"    <--- so just of a bunch of character strings

# Force it to be a factor type variable
pass.fail <- as.factor(pass.fail)
pass.fail
# [1] Pass Fail Fail Fail Fail Pass Pass
# Levels: Fail Pass

class(pass.fail)
# [1] "factor"

Another example of creating a factor variable:

operator.id <- c(10, 12, 11, 10, 11, 10, 12, 11, 10)
op.names <- factor(operator.id,
                   levels=c("10", "11", "12"),
                   labels=c("Pat", "Sarah", "Stef"))
# [1] Pat   Stef  Sarah Pat   Sarah Pat   Stef  Sarah Pat
# Levels: Pat Sarah Stef
is.factor(op.names)
[1] TRUE

You may not even need to create a factor variable in many cases. When you import a data set R will detect and create factors automatically - usually it gets it right - like the "Yes"/"No" baffles variable:

bio <- read.csv('http://openmv.net/file/bioreactor-yields.csv')
bio

   temperature duration speed baffles   yield
1           82      260  4300      No      51
2           90      260  3700     Yes      30
3           88      260  4200     Yes      40
4           86      260  3300     Yes      28
5           80      260  4300      No      49
6           78      260  4300     Yes      49
7           82      260  3900     Yes      44
8           83      260  4300      No      59
9           64      260  4300      No      60
10          73      260  4400      No      59
11          60      260  4400      No      57
12          60      260  4400      No      62
13         101      260  4400      No      42
14          92      260  4900     Yes      38

is.factor(bio$baffles)
[1] TRUE

Fitting a linear model with this integer variable:

model <- lm(bio$yield ~ bio$temperature + bio$speed + bio$baffles)
coef(model)
#  (Intercept)  temperature        speed   bafflesYes
# 52.483652163 -0.470996834  0.008710973 -9.090699955

So the -9.09 is the model coefficient for when the baffles variable is at the "Yes" level. You can view the underlying \(\mathbf{X}\) matrix for this linear model quite easily:

model.matrix(model)

   (Intercept) temperature speed bafflesYes
1            1          82  4300          0
2            1          90  3700          1
3            1          88  4200          1
4            1          86  3300          1
5            1          80  4300          0
6            1          78  4300          1
7            1          82  3900          1
8            1          83  4300          0
9            1          64  4300          0
10           1          73  4400          0
11           1          60  4400          0
12           1          60  4400          0
13           1         101  4400          0
14           1          92  4900          1
attr(,"assign")
[1] 0 1 2 3
attr(,"contrasts")
attr(,"contrasts")$baffles
[1] "contr.treatment"

These would be the column in the \(\mathbf{X}\) matrix, confirming that the coefficient for baffles is the effect of going from No to Yes.

We point this out, because R will by default create the factors in alphabetical order (0 = "No", "1"="Yes"). But in other cases this default leads to the opposite of what you might want, for example 0="Accept", 1="Reject". You can always reorder an existing factor:

result <- c("Accept", "Accept", "Reject", "Accept", "Accept", "Reject")
result <- factor(result, levels=c("Reject", "Accept"))  # switch the default order around
result
[1] Accept Accept Reject Accept Accept Reject
Levels: Reject Accept

Predictions when integer variables are in the model

As before, we use the predict() function, once we have a data frame containing the new data. Create two observations where the only difference is the baffle indicator:

x.new = data.frame(temperature=82, speed=4200, baffles=c("Yes", "No"))
x.new
#   temperature speed baffles
# 1          82  4200     Yes
# 2          82  4200      No
y.new = predict(model, newdata=x.new, interval="p")
y.new
      fit      lwr      upr
1 41.3573 30.03813 52.67647
2 50.4480 39.23712 61.65888

The above output shows the effect of a baffle, together with the prediction intervals. Does this output match the interpretation of the model coefficient for the baffle variable?