2.4. Histograms and probability distributions

The previous section has hopefully convinced you that variation in a process is inevitable. This section aims to show how we can visualize and quantify any variability in a recorded vector of data.

A histogram is a summary of the variation in a measured variable. It shows the number of samples that occur in a category: this is called a frequency distribution. For example: number of children born, categorized against their birth gender: male or female.

fake width

The raw data in the above example was a vector that consisted of 2739 text entries, with 1420 of them as Male and 1319 of them as Female. In this case Female and Male represent the two categories.

Histograms make sense for categorical variables, but a histogram can also be derived from a continuous variable. Here is an example showing the mass of cartons of 1 kg of flour. The continuous variable, mass, is divided into equal-size bins that cover the range of the available data. Notice how the packaging system has to overfill each carton so that the vast majority of packages weigh over 1 kg (what is the average package mass?). If the variability in the packaging system could be reduced - the spread of the data made narrower - then the histogram can be shifted to the left, thereby reducing overfill.

fake width
# Create 500 normally distributed points # with a mean of 1100 and standard deviation # of 50 units. data = rnorm(500, mean=1100, sd=50) hist(data, xlab="Mass [g] of each package", ylab="Number of packages (N=500)")
# Create 500 normally distributed points # with a mean of 1100 and standard deviation # of 50 units. import numpy as np import matplotlib.pyplot as plt N = 500 values = np.random.normal(loc=1100, scale=50, size=N) plt.hist(values, color="white", bins=8) plt.xlabel("Mass [g] of each package") plt.ylabel("Number of packages (N={})".format(N)) plt.show()

Try creating a fictitious histogram for each of the following situations:

  • The grades for a class for a really easy test.

  • The numbers thrown from a 6-sided die.

  • The annual income for people in your country.

  • Analytical measurements taken in a laboratory, by the same person or computerized process.

In preparing the above histograms, what have you implicitly inferred about time-scales? These histograms show the long-term distribution (probabilities) of the system being considered. This is why concepts of chance and random phenomena can be use to described systems and processes. Probabilities can be used to describe our long-term expectations. Let us contrast some long-term and short-term expectations next:

  • The long-term sex ratio at birth 1.06:1 (boy:girl) is expected in Canada; but a newly pregnant mother would not know the sex.

  • The long-term data from a process shows an 85% output yield from our batch reactor; but tomorrow it could be 59% and the day after that 86%.

  • We know that a fair die has a 16.67% chance of showing a 4 when thrown, but we cannot predict the value of the next throw.

Even if we have complete mechanistic knowledge of our process, the concepts from probability and statistics are useful to summarize and communicate information about past behaviour, and the expected future behaviour.

Steps to creating a frequency distribution, illustrated with 4 examples, labelled A, B, C, and D.

  1. Decide what you are measuring:

    1. acceptable or unacceptable metal appearance: yes/no

    2. number of defects on a metal sheet: none, low, medium, high

    3. yield from the batch reactor: somewhat continuous - quantized due to rounding to the closest integer

    4. daily ambient temperature, in Kelvin: continuous values

  2. Decide on a resolution for the measurement axis:

    1. acceptable/unacceptable (1/0) code for the metal’s appearance

    2. use a scale from 1 to 4 that grades the metal’s appearance

    3. batch yield is measured in 1% increments, reported either as 78, 79, 80, 81%, etc.

    4. temperature is measured to a 0.05 K precision, but we can report the values in bins of 5K

  3. Report the number of observations in the sample or population that fall within each bin (resolution step):

    1. number of metal pieces with appearance level “acceptable” and “unacceptable” are added up

    2. number of pieces with defect level 1, 2, 3, 4 are counted

    3. number of batches with yield inside each bin level are calculated

    4. number of temperature values inside each bin level are computed

  4. Plot the number of observations in category as a bar plot. If you plot the number of observations divided by the total number of observations, \(N\), then you are plotting the relative frequency.

A relative frequency, also called density, is sometimes preferred:

  • we do not need to report the total number of observations, \(N\)

  • it can be compared to other distributions

  • if \(N\) is large enough, then the relative frequency histogram starts to resemble the population’s distribution

  • the area under the histogram is equal to 1, and related to probability

# 1000 normally distributed values N = 1000 values = rnorm(N) hist(values, freq=TRUE, xlab="Random values", cex.lab=1.5, cex.main=1.8, lwd=2, cex.sub=1.8, cex.axis=1.8, ylab=paste0("Frequency (N=",N,")")) hist(values, freq=FALSE, xlab="Random values", cex.lab=1.5, cex.main=1.8, lwd=2, cex.sub=1.8, cex.axis=1.8, ylab="Relative density") # Compare the two plots: only the y-axis # changes but the general shape remains.
# Create 1000 normally distributed points # with mean of 0 and standard deviation of 1. import numpy as np import matplotlib.pyplot as plt N = 1000 values = np.random.normal(loc=0, scale=1, size=N) plt.subplot(1, 2, 1) plt.hist(values, color="white") plt.ylabel("Frequency (N={})".format(N)) plt.subplot(1, 2, 2) plt.hist(values, color="white", # For older matplotlib versions normed=True, # Rather, use 'density' instead #density=True ) plt.ylabel("Relative density") plt.tight_layout() plt.show()