# Software tutorial/Linear models with integer variables

 ← Linear models with multiple X-variables (MLR) (previous step) Tutorial index

The only additional step required to include an integer variable is to tell R that the variable a factor or categorical type variable. R will then take care of expanding it into the extra columns required to fit the linear model.

The rest of the tools for linear models are then used as normal, e.g. confint(model), predict(model, ...) and so on.

Let's start by creating a factor variable ourself. Create a vector of "Pass" and "Fail" entries and convert it to a factor variable:

pass.fail <- c("Pass", "Fail", "Fail", "Fail", "Fail", "Pass", "Pass")

# What type of variable is this currently (i.e. what type of class of variable)?
class(pass.fail)
# [1] "character"    <--- so just of a bunch of character strings

# Force it to be a factor type variable
pass.fail <- as.factor(pass.fail)
pass.fail
# [1] Pass Fail Fail Fail Fail Pass Pass
# Levels: Fail Pass

class(pass.fail)
# [1] "factor"


Another example of creating a factor variable:

operator.id <- c(10, 12, 11, 10, 11, 10, 12, 11, 10)
op.names <- factor(operator.id,
levels=c("10", "11", "12"),
labels=c("Pat", "Sarah", "Stef"))
# [1] Pat   Stef  Sarah Pat   Sarah Pat   Stef  Sarah Pat
# Levels: Pat Sarah Stef
is.factor(op.names)
[1] TRUE


You may not even need to create a factor variable in many cases. When you import a data set R will detect and create factors automatically - usually it gets it right - like the "Yes"/"No" baffles variable:

bio <- read.csv('http://openmv.net/file/bioreactor-yields.csv')
bio

temperature duration speed baffles   yield
1           82      260  4300      No      51
2           90      260  3700     Yes      30
3           88      260  4200     Yes      40
4           86      260  3300     Yes      28
5           80      260  4300      No      49
6           78      260  4300     Yes      49
7           82      260  3900     Yes      44
8           83      260  4300      No      59
9           64      260  4300      No      60
10          73      260  4400      No      59
11          60      260  4400      No      57
12          60      260  4400      No      62
13         101      260  4400      No      42
14          92      260  4900     Yes      38

is.factor(bio$baffles) [1] TRUE  Fitting a linear model with this integer variable: model <- lm(bio$yield ~ bio$temperature + bio$speed + bio$baffles) coef(model) # (Intercept) temperature speed bafflesYes # 52.483652163 -0.470996834 0.008710973 -9.090699955  So the -9.09 is the model coefficient for when the baffles variable is at the "Yes" level. You can view the underlying $$\mathbf{X}$$ matrix for this linear model quite easily: model.matrix(model) (Intercept) temperature speed bafflesYes 1 1 82 4300 0 2 1 90 3700 1 3 1 88 4200 1 4 1 86 3300 1 5 1 80 4300 0 6 1 78 4300 1 7 1 82 3900 1 8 1 83 4300 0 9 1 64 4300 0 10 1 73 4400 0 11 1 60 4400 0 12 1 60 4400 0 13 1 101 4400 0 14 1 92 4900 1 attr(,"assign") [1] 0 1 2 3 attr(,"contrasts") attr(,"contrasts")$baffles
[1] "contr.treatment"


These would be the column in the $$\mathbf{X}$$ matrix, confirming that the coefficient for baffles is the effect of going from No to Yes.

We point this out, because R will by default create the factors in alphabetical order (0 = "No", "1"="Yes"). But in other cases this default leads to the opposite of what you might want, for example 0="Accept", 1="Reject". You can always reorder an existing factor:

result <- c("Accept", "Accept", "Reject", "Accept", "Accept", "Reject")
result <- factor(result, levels=c("Reject", "Accept"))  # switch the default order around
result
[1] Accept Accept Reject Accept Accept Reject
Levels: Reject Accept


# Predictions when integer variables are in the model

As before, we use the predict() function, once we have a data frame containing the new data. Create two observations where the only difference is the baffle indicator:

x.new = data.frame(temperature=82, speed=4200, baffles=c("Yes", "No"))
x.new
#   temperature speed baffles
# 1          82  4200     Yes
# 2          82  4200      No
y.new = predict(model, newdata=x.new, interval="p")
y.new
fit      lwr      upr
1 41.3573 30.03813 52.67647
2 50.4480 39.23712 61.65888


The above output shows the effect of a baffle, together with the prediction intervals. Does this output match the interpretation of the model coefficient for the baffle variable?