Software tutorial/Linear models with integer variables

From Statistics for Engineering
Jump to navigation Jump to search
← Linear models with multiple X-variables (MLR) (previous step) Tutorial index


<rst> <rst-options: 'toc' = False/> <rst-options: 'reset-figures' = False/> The only additional step required to include an integer variable is to tell R that the variable a ``factor`` or categorical type variable. R will then take care of expanding it into the extra columns required to fit the linear model.

The rest of the tools for linear models are then used as normal, e.g. ``confint(model)``, ``predict(model, ...)`` and so on.

Let's start by creating a factor variable ourself. Create a vector of "Pass" and "Fail" entries and convert it to a factor variable:

.. code-block:: s

pass.fail <- c("Pass", "Fail", "Fail", "Fail", "Fail", "Pass", "Pass")

# What type of variable is this currently (i.e. what type of class of variable)? class(pass.fail) # [1] "character" <--- so just of a bunch of character strings

# Force it to be a factor type variable pass.fail <- as.factor(pass.fail) pass.fail # [1] Pass Fail Fail Fail Fail Pass Pass # Levels: Fail Pass

class(pass.fail) # [1] "factor"

Another example of creating a factor variable:

.. code-block:: s

operator.id <- c(10, 12, 11, 10, 11, 10, 12, 11, 10) op.names <- factor(operator.id, levels=c("10", "11", "12"), labels=c("Pat", "Sarah", "Stef")) # [1] Pat Stef Sarah Pat Sarah Pat Stef Sarah Pat # Levels: Pat Sarah Stef is.factor(op.names) [1] TRUE

You may not even need to create a factor variable in many cases. When you import a data set R will detect and create factors automatically - usually it gets it right - like the "Yes"/"No" baffles variable:

.. code-block:: s

bio <- read.csv('http://openmv.net/file/bioreactor-yields.csv') bio

temperature duration speed baffles yield 1 82 260 4300 No 51 2 90 260 3700 Yes 30 3 88 260 4200 Yes 40 4 86 260 3300 Yes 28 5 80 260 4300 No 49 6 78 260 4300 Yes 49 7 82 260 3900 Yes 44 8 83 260 4300 No 59 9 64 260 4300 No 60 10 73 260 4400 No 59 11 60 260 4400 No 57 12 60 260 4400 No 62 13 101 260 4400 No 42 14 92 260 4900 Yes 38

is.factor(bio$baffles) [1] TRUE Fitting a linear model with this integer variable: .. code-block:: s model <- lm(bio$yield ~ bio$temperature + bio$speed + bio$baffles) coef(model) # (Intercept) temperature speed bafflesYes # 52.483652163 -0.470996834 0.008710973 -9.090699955 So the -9.09 is the model coefficient for when the ``baffles`` variable is at the ``"Yes"`` level. You can view the underlying :math:`\mathbf{X}` matrix for this linear model quite easily: .. code-block:: s model.matrix(model) (Intercept) temperature speed bafflesYes 1 1 82 4300 0 2 1 90 3700 1 3 1 88 4200 1 4 1 86 3300 1 5 1 80 4300 0 6 1 78 4300 1 7 1 82 3900 1 8 1 83 4300 0 9 1 64 4300 0 10 1 73 4400 0 11 1 60 4400 0 12 1 60 4400 0 13 1 101 4400 0 14 1 92 4900 1 attr(,"assign") [1] 0 1 2 3 attr(,"contrasts") attr(,"contrasts")$baffles [1] "contr.treatment"

These would be the column in the :math:`\mathbf{X}` matrix, confirming that the coefficient for ``baffles`` is the effect of going from ``No`` to ``Yes``.

We point this out, because R will by default create the factors in alphabetical order (0 = "No", "1"="Yes"). But in other cases this default leads to the opposite of what you might want, for example 0="Accept", 1="Reject". You can always reorder an existing factor:

.. code-block:: s

result <- c("Accept", "Accept", "Reject", "Accept", "Accept", "Reject") result <- factor(result, levels=c("Reject", "Accept")) # switch the default order around result [1] Accept Accept Reject Accept Accept Reject Levels: Reject Accept


Predictions when integer variables are in the model


As before, we use the ``predict()`` function, once we have a data frame containing the new data. Create two observations where the only difference is the baffle indicator:

.. code-block:: s

x.new = data.frame(temperature=82, speed=4200, baffles=c("Yes", "No")) x.new # temperature speed baffles # 1 82 4200 Yes # 2 82 4200 No y.new = predict(model, newdata=x.new, interval="p") y.new fit lwr upr 1 41.3573 30.03813 52.67647 2 50.4480 39.23712 61.65888

The above output shows the effect of a baffle, together with the prediction intervals. Does this output match the interpretation of the model coefficient for the ``baffle`` variable? </rst>