4.3. Covariance

You probably have an intuitive sense for what it means when two things are correlated. We will get to correlation next, but we start by first looking at covariance. Let’s take a look at an example to formalize this, and to see how we can learn from data.

Consider the measurements from a gas cylinder; temperature (K) and pressure (kPa). We know the ideal gas law applies under moderate conditions: \(pV = nRT\).

  • Fixed volume, \(V = 20 \times 10^{-3} \text{m}^3\) = 20 L

  • Moles of gas, \(n = 14.1\) mols of chlorine gas, molar mass = 70.9 g/mol, so this is 1 kg of gas

  • Gas constant, \(R = 8.314\) J/(mol.K)

Given these numbers, we can simplify the ideal gas law to: \(p=\beta_1 T\), where \(\beta_1 = \dfrac{nR}{V} > 0\). These data are collected from sampling the system:

fake width

The formal definition for covariance between any two variables is: [terminology used here was defined in a previous section]

(1)\[ \text{Cov}\left\{x, y\right\} = \mathcal{E}\left\{ (x - \overline{x}) (y - \overline{y})\right\} \qquad \text{where} \qquad \mathcal{E}\left\{ z \right\} = \overline{z}\]

Use this to calculate the covariance between temperature and pressure by breaking the problem into steps:

  • First calculate deviation variables. They are called this because they are now the deviations from the mean: \(T - \overline{T}\) and \(p - \overline{p}\). Subtracting off the mean from each vector just centers their frame of reference to zero.

  • Next multiply the two vectors, element-by-element, to calculate a new vector \((T - \overline{T}) (p - \overline{p})\).

    temp <- c(273, 285, 297, 309, 321, 333, 345, 357, 369, 381) pres <- c(1600, 1670, 1730, 1830, 1880, 1920, 2000, 2100, 2170, 2200) humidity <- c(42, 48, 45, 49, 41, 46, 48, 48, 45, 49) temp.centered <- temp - mean(temp) pres.centered <- pres - mean(pres) product <- temp.centered * pres.centered # R does element-by-element multiplication in the above line print(product) # [1] 16740 10080 5400 1440 180 # 60 1620 5700 10920 15660 # Average of 'product': mean(product) # 6780 # Calculated covariance is 7533.33 paste0('Covariance of temperature and ', 'pressure is = ', round(cov(temp, pres), 2)) # The covariance of a variable with # itself is just the variance: paste0('Covariance with itself is = ', round(cov(temp, temp), 2)) paste0('while the variance = ', round(var(temp), 2))
  • The expected value of this product can be estimated by using the average, or any other suitable measure of location. In this case mean(product) in R gives 6780. This is the covariance value.

  • More specifically, we should provide the units as well: the covariance between temperature and pressure is 6780 [K.kPa] in this example. Similarly the covariance between temperature and humidity is 202 [K.%].

In your own time calculate a rough numeric value and give the units of covariance for these cases:

\(x\)

\(y\)

\(x\) = age of married partner 1

\(y\) = age of married partner 2

\(x\) = gas pressure

\(y\) = gas volume at a fixed temperature

\(x\) = mid term mark for this course

\(y\) = final exam mark

\(x\) = hours worked per week

\(y\) = weekly take home pay

\(x\) = cigarettes smoked per month

\(y\) = age at death

\(x\) = temperature on top tray of distillation column

\(y\) = top product purity

Also describe what an outlier observation would mean in these cases.

One last point is that the covariance of a variable with itself is the variance: \(\text{Cov}\left\{x, x\right\} = \mathcal{V}(x) = \mathcal{E}\left\{ (x - \overline{x}) (x - \overline{x})\right\}\), a definition we saw earlier.

Using the cov(temp, pres) function in R gives 7533.333, while we calculated 6780. The difference comes from \(6780 \times \dfrac{N}{N-1}= 7533.33\), indicating that R divides by \(N-1\) rather than \(N\). This is because the variance function in R for a vector x is internally called as cov(x, x). Since R returns the unbiased variance, it divides through by \(N-1\). This inconsistency does not really matter for large values of \(N\), but emphasizes that one should always read the documentation for the software being used.

Note that deviation variables are not affected by a shift in the raw data of \(x\) or \(y\). For example, measuring temperature in Celsius or Kelvin has no effect on the covariance number; but measuring it in Celsius vs Fahrenheit does change the covariance value.

4.4. Correlation

The variance and covariance values are units dependent. For example, you get a very different covariance when calculating it using grams vs kilograms. The correlation on the other hand removes the effect of scaling and arbitrary unit changes. It is defined as:

(2)\[ \text{Correlation}\,\,=\,\,r(x, y) = \dfrac{\mathcal{E}\left\{ (x - \overline{x}) (y - \overline{y})\right\}}{\sqrt{\mathcal{V}\left\{x\right\}\mathcal{V}\left\{y\right\}}} = \dfrac{\text{Cov}\left\{x, y\right\}}{\sqrt{\mathcal{V}\left\{x\right\}\mathcal{V}\left\{y\right\}}}\]

It takes the covariance value and divides through by the units of \(x\) and of \(y\) to obtain a dimensionless result. The values of \(r(x,y)\) range from \(-1\) to \(+1\). Also note that \(r(x,y) = r(y,x)\).

So returning back to our example of the gas cylinder, the correlation between temperature and pressure, and temperature and humidity can be calculated now as:

temp <- c(273, 285, 297, 309, 321, 333, 345, 357, 369, 381) pres <- c(1600, 1670, 1730, 1830, 1880, 1920, 2000, 2100, 2170, 2200) humidity <- c(42, 48, 45, 49, 41, 46, 48, 48, 45, 49) # Correlation between temperature # and pressure is high: 0.9968355 cor(temp, pres) # Correlation between temperature # and humidity is low: 0.3803919 cor(temp, humidity) # What is correlation of humidity # and pressure? cor(___, ___)

Note that correlation is the same whether we measure temperature in Celsius or Kelvin. Study the plots here to get a feeling for the correlation value and its interpretation:

fake width

4.5. Some definitions

Be sure that you can derive (and interpret!) these relationships, which are derived from the definition of the covariance and correlation:

  • \(\mathcal{E}\{x\} = \overline{x}\)

  • \(\mathcal{E}\{x+y\} = \mathcal{E}\{x\} + \mathcal{E}\{y\} = \overline{x} + \overline{y}\)

  • \(\mathcal{V}\{x\} = \mathcal{E}\{(x-\overline{x})^2\}\)

  • \(\mathcal{V}\{cx\} = c^2\mathcal{V}\{x\}\)

  • \(\text{Cov}\{x,y\} = \mathcal{E}\{(x-\overline{x})(y-\overline{y})\}\) which we take as the definition for covariance

  • \(\mathcal{V}\{x+x\} = 2\mathcal{V}\{x\} + 2\text{Cov}\{x,x\} = 4\mathcal{V}\{x\}\)

  • \(\text{Cov}\{x,y\} = \mathcal{E}\{xy\} - \mathcal{E}\{x\}\mathcal{E}\{y\}\)

  • \(\text{Cov}\{x,c\} = 0\)

  • \(\text{Cov}\{x+a, y+b\} = \text{Cov}\{x,y\}\)

  • \(\text{Cov}\{ax, by\} = ab \cdot \text{Cov}\{x,y\}\)

  • \(\mathcal{V}\{x+y\} \neq \mathcal{V}\{x\} + \mathcal{V}\{y\}\), which is counter to what might be expected.

  • Rather:

    (3)\[\begin{split}\mathcal{V}\{x+y\} &= \mathcal{E}\{ \left( x+y-\overline{x}-\overline{y} \right)^2 \} \\ &= \mathcal{E}\{ \left( (x-\overline{x}) + (y-\overline{y}) \right)^2 \} \\ &= \mathcal{E}\{ (x-\overline{x})^2 + 2(x-\overline{x})(y-\overline{y}) + (y-\overline{y})^2 \}\\ &= \mathcal{E}\{ (x-\overline{x})^2 \} + 2\mathcal{E}\{(x-\overline{x})(y-\overline{y})\} + \mathcal{E}\{(y-\overline{y})^2 \} \\ &= \mathcal{V}\{ x \} + 2\text{Cov}\{x,y\} + \mathcal{V}\{ y \}\\ \mathcal{V}\{x+y\} &= \mathcal{V}\{x\} + \mathcal{V}\{y\}, \qquad\text{only if $x$ and $y$ are independent}\end{split}\]