6.7.10. Analysis of designed experiments using PLS models

Data from a designed experiment, particularly factorial experiments, will have independent columns in \(\mathbf{X}\). These data tables are adequately analyzed using multiple linear regression (MLR) least squares models.

These factorial and fractional factorial data are also well suited to analysis with PLS. Since factorial models support interaction terms, these additional interactions should be added to the \(\mathbf{X}\) matrix. For example, a full factorial design with variables A, B and C will also support the AB, AC, BC and ABC interactions. These four columns should be added to the \(\mathbf{X}\) matrix so that the loadings for these variables are also estimated. If a central composite design, or some other design that supports quadratic terms has been performed, then these columns should also be added to \(\mathbf{X}\), e.g.: \(\mathbf{\text{A}}^2\), \(\mathbf{\text{B}}^2\) and \(\mathbf{\text{C}}^2\).

The PLS loadings plots from analyzing these DOE data are interpreted in the usual manner; and the coefficient plot is informative if \(A>2\).

There are some other advantages of using and interpreting a PLS model built from DOE data, rather than using the MLR approach:

  • If additional data (not the main factors) are captured during the experiments, particularly measurable disturbances, then these additional columns can, and should, be included in \(\mathbf{X}\). These extra data are called covariates in other software packages. These additional columns will remove some of the orthogonality in \(\mathbf{X}\), but this is why a PLS model would be more suitable.

  • If multiple \(\mathbf{Y}\) measurements were recored as the response, and particularly if these \(\mathbf{Y}\) variables are correlated, then a PLS model would be better suited than building \(K\) separate MLR models. A good example is where the response variable from the experiment is a complete spectrum of measurements, such as from a NIR probe.

One other point to note when analyzing DOE data with PLS is that the \(Q^2\) values from cross-validation are often very small. This makes intuitive sense: if the factorial levels are suitably spaced, then each experiment is at a point in the process that provides new information. It is unlikely that cross-validation, when leaving out one or more experiments, is able to accurately predict each corner in the factorial.

Lastly, models built from DOE data allow a much stronger interpretation of the loading vectors, \(\mathbf{R:C}\). This time we can infer cause-and-effect behaviour; normally in PLS models the best we can say is that the variables in \(\mathbf{X}\) and \(\mathbf{Y}\) are correlated. Experimental studies that are run in a factorial manner will break happenstance correlation structures; so if any correlation that is present, then this truly is causal in nature.