2.16. Exercises¶

Question 1

Recall that $μ = E (x) = \frac{1}{N} \sum x$ and $V {x} = E {(x - μ)^{2}} = σ^{2} = \frac{1}{N} \sum (x - μ)^{2}$ .

What is the expected value thrown of a fair 6-sided die? (Note: plural of die is dice)
What is the expected variance of a fair 6-sided die?

Question 2

Characterizing a distribution: Compute the mean, median, standard deviation and MAD for salt content for the various soy sauces given in this report (page 41) as described in the the article from the Globe and Mail on 24 September 2009. Plot a box plot of the data and report the interquartile range (IQR). Comment on the 3 measures of spread you have calculated: standard deviation, MAD, and interquartile range.

The raw data are given below in units of milligrams of salt per 15 mL serving:

[460, 520, 580, 700, 760, 770, 890, 910, 920, 940, 960, 1060, 1100]

Short answer: Click to show answer

Question 3

Give a reason why Statistics Canada reports the median income when reporting income by geographic area. Where would you expect the mean to lie, relative to the median? Use this table to look up the income for Hamilton. How does it compare to Toronto? And all of Canada?

Solution Click to show answer

Question 4

Use the data set on raw materials.

How many variables in the data set?

How many observations?

The data are properties of a powder. Plot each variable, one at a time, and locate any outliers. R-users will benefit from the R tutorial (see the use of the identify function).

Solution Click to show answer

Question 5

Write a few notes on the purpose of feedback control, and its effect on variability of process quality.

Question 6

Use the section on Historical data from Environment Canada’s website and use the Customized Search option to obtain data for the HAMILTON A station from 2000 to 2009. Use the settings as Year=2000, and Data interval=Monthly and request the data for 2000, then click Next year to go to 2001 and so on.

For each year from 2000 to 2009, get the total snowfall and the average of the Mean temp over the whole year (the sums and averages are reported at the bottom of the table).

Plot these 2 variables against time

Now retrieve the long-term averages for these data from a different section of their website (use the same location, HAMILTON A, and check that the data range is 1971 to 2000). Superimpose the long-term average as a horizontal line on your previous plot.

Note: the purpose of this exercise is more for you to become comfortable with web-based data retrieval, which is common in most companies.

Note: please use any other city for this question if you prefer.

Question 7

Does the number of visits in the website traffic data set follow a normal distribution? If so, what are the parameters for the distribution? What is the likelihood that you will have between 10 and 30 visits to the website?

Short answer: Click to show answer

Question 8

The ammonia concentration in your wastewater treatment plant is measured every 6 hours. The data for one year are available from the dataset website.

Use a visualization plot to hypothesize from which distribution the data might come. Which distribution do you think is most likely? Once you’ve decided on a distribution, use a qq-plot to test your decision.
Estimate location and spread statistics assuming the data are from a normal distribution. You can investigate using the fitdistr function in R, in the MASS package.
What if you were told the measured values are not independent. How does it affect your answer?
What is the probability of having an ammonia concentration greater than 40 mg/L when:
- you may use only the data (do not use any estimated statistics)
- you use the estimated statistics for the distribution?
Note: Answer this entire question using computer software to calculate values from the normal distribution. But also make sure you can answer the last part of the question by hand, (when given the mean and variance), and using a table of normal distributions.

Question 9

We take a large bale of polymer composite from our production line and using good sampling techniques, we take 9 samples from the bale and measure the viscosity in the lab for each sample. These samples are independent estimates of the population (bale) viscosity. We will believe these samples follow a normal distribution (we could confirm this in practice by running tests and verifying that samples from any bale are normally distributed). Here are 9 sampled values: 23, 19, 17, 18, 24, 26, 21, 14, 18.

The sample average

An estimate of the standard deviation

What is the distribution of the sample average, $\overset{―}{x}$ ? What are the parameters of that distribution?

Additional information: I use a group of samples and calculate the mean, $\overset{―}{x}$ , then I take another group of samples and calculate another $\overset{―}{x}$ , and so on. Those values of $\overset{―}{x}$ are not going to be the same, but they should be similar. In other words, the $\overset{―}{x}$ also has a distribution. So this question asks what that distribution is, and what its parameters are.

Construct an interval, symbolically, that will contain, with 95% certainty (probability), the population mean of the viscosity.

Additional information: To answer this part, you should move everything to $z$ -coordinates first. Then you need to find the points $- c$ and $+ c$ in the following diagram that mark the boundary for a 95% of the total area under the distribution. This region is an interval that will contain, with 95% certainty, the population mean of the viscosity, $μ$ . Write your answer in form: $LB < μ < UB$ .

Now assume that for some hypothetical reason we know the standard deviation of the bale’s viscosity is $σ = 3.5$ units, calculate the population mean’s interval numerically.

Additional information: In this part you are just finding the values of $LB$ and $UB$

Short answer: Click to show answer

Question 10

You are responsible for the quality of maple syrup produced at your plant. Historical data show that the standard deviation of the syrup viscosity is 40 cP. How many lab samples of syrup must you measure so that an estimate of the syrup’s long-term average viscosity is inside a range of 60 cP, 95% of the time? This question is like the previous one: except this time you are given the range of the interval $UB - LB$ , and you need to find $n$ .

Short answer: Click to show answer

Question 11

Your manager is asking for the average viscosity of a product that you produce in a batch process. Recorded below are the 12 most recent values, taken from consecutive batches. State any assumptions, and clearly show the calculations which are required to estimate a 95% confidence interval for the mean. Interpret that confidence interval for your manager, who is not sure what a confidence interval is.

\begin{aligned} Raw data: & [13.7, 14.9, 15.7, 16.1, 14.7, 15.2, 13.9, 13.9, 15.0, 13.0, 16.7, 13.2] \\ Mean: & 14.67 \\ Standard deviation: & 1.16 \end{aligned}

Ensure you can also complete the question by hand, using statistical tables.

Question 12

A new wastewater treatment plant is being commissioned and part of the commissioning report requires a statement of the confidence interval of the biochemical oxygen demand (BOD). How many samples must you send to the lab to be sure the true BOD is within a range of 2 mg/L, centered about the sample average? If there isn’t enough information given here, specify your own numbers and assumptions and work with them to answer the question.

Question 13

One of the questions we posed at the start of this chapter was: Here are the yields from a batch bioreactor system for the last 3 years (300 data points; we run a new batch about every 3 to 4 days).

What sort of distribution do the yield data have?
A recorded yield value was less than 60%, what are the chances of that occurring? Express your answer as: there’s a 1 in n chance of it occurring.
Which assumptions do you have to make for the second part of this question?

Question 14

One aspect of your job responsibility is to reduce energy consumption on the plant floor. You ask the electrical supplier for the energy requirements (W.h) for running a particular light fixture for 24 hours. They won’t give you the raw data, only their histogram when they tested randomly selected bulbs (see the data and code below).

> bin.centers <- c(4025, 4075, 4125, 4175, 4225, 4275, 4325, 4375)
> bin.counts <- c(4, 19, 14,  5,  4,  1,  2,  1)
> barplot(bin.counts, names.arg=bin.centers, ylab="Number of bulbs (N=50)",
     xlab="Energy required over 24 hours (W.h)", col="White", ylim=c(0,20))
Calculate an estimate of the mean and standard deviation, even though you don’t have the original data.

What is a confidence interval for the mean at 95% probability, stating and testing any assumptions you need to make.

Short answer: Click to show answer

Question 15

The confidence interval for the population mean takes one of two forms below, depending on whether we know the variance or not. At the 90% confidence level, for a sample size of 13, compare and comment on the upper and lower bounds for the two cases. Assume that $s = σ = 3.72$ .

$\begin{array}{r} \begin{array}{rcccl} - c_{n} & \leq & \frac{\overset{―}{x} - μ}{σ / \sqrt{n}} & \leq & c_{n} \\ - c_{t} & \leq & \frac{\overset{―}{x} - μ}{s / \sqrt{n}} & \leq & c_{t} \end{array} \end{array}$

Question 16

A major aim of many engineers is/will be to reduce the carbon footprint of their company’s high-profile products. Next week your boss wants you to evaluate a new raw material that requires 2.6 $\frac{{kg CO}_{2}}{kg product}$ less than the current material, but the final product’s brittleness must be the same as achieved with the current raw material. This is a large reduction in ${CO}_{2}$ , given your current production capacity of 51,700 kg of product per year. Manpower and physical constraints prevent you from running a randomized test; you don’t have a suitable database of historical data either.

One idea you come up with is to use to your advantage the fact that your production line has three parallel reactors, TK104, TK105, and TK107. They were installed at the same time, they have the same geometry, the same instrumentation, etc; you have pretty much thought about every factor that might vary between them, and are confident the 3 reactors are identical. Typical production schedules split the raw material between the 3 reactors. Data on the website contain the brittleness values from the three reactors for the past few runs on the current raw material.

Which two reactors would you pick to run your comparative trial on next week?

Repeat your calculations assuming pairing.

Short answer: Click to show answer

Question 17

Use the website traffic data from the dataset website:

Write down, symbolically, the z-value for the difference in average visits on a Friday and Saturday.
Estimate a suitable value for the variance and justify your choice.
What is the probability of obtaining a z-value of this magnitude or smaller? Would you say the difference is significant?
Pick any other 2 days that you would find interesting to compare and repeat your analysis.

Solution Click to show answer

Let our variable of interest be the difference between the average of the 2 groups: ${\overset{―}{x}}_{Fri} - {\overset{―}{x}}_{Sat}$ . This variable will be distributed normally (why? - see the notes) according to ${\overset{―}{x}}_{Fri} - {\overset{―}{x}}_{Sat} \sim N (μ_{Fri} - μ_{Sat}, σ_{diff}^{2})$ . So the z-value for this variable is: $z = \frac{({\overset{―}{x}}_{Fri} - {\overset{―}{x}}_{Sat}) - (μ_{Fri} - μ_{Sat})}{σ_{diff}}$
The variance of the difference, $σ_{diff}^{2} = σ^{2} (\frac{1}{n_{Fri}} + \frac{1}{n_{Sat}})$ , where $σ^{2}$ is the variance of the number of visits to the website on Friday and Saturday. Since we don’t know that value, we can estimate it from pooling the 2 variances of each group. We should calculate first that these variances are comparable (they are; but you should confirm this yourself).

\begin{aligned} σ^{2} \approx s_{P}^{2} & = \frac{(n_{Fri} - 1) s_{Fri}^{2} + (n_{Sat} - 1) s_{Sat}^{2}}{n_{Fri} - 1 + n_{Sat} - 1} \\ = \frac{29 \times 45.56 + 29 \times 48.62}{58} \\ = 47.09 \end{aligned}

The z-value calculated from this pooled variance is:

$z = \frac{20.77 - 15.27}{47.09 (\frac{1}{30} + \frac{1}{30})} = 3.1$

But since we used an estimated variance, we cannot say that $z$ comes from the normal distribution anymore. It now follows the $t$ -distribution with 58 degrees of freedom (which is still comparable to the normal distribution - see question 7 below). The corresponding probability that $z < 3.1$ is 99.85%, using the $t$ -distribution with 58 degrees of freedom. This difference is significant; there is a very small probability that this difference is due to chance alone.
The code was modified to generate the matrix of z-value results in the comments below. The largest difference is between Sunday and Wednesday, and the smallest difference is between Monday and Tuesday.

website <- read.csv('http://openmv.net/file/website-traffic.csv')
attach(website)

visits.Mon <- Visits[DayOfWeek=="Monday"]
visits.Tue <- Visits[DayOfWeek=="Tuesday"]
visits.Wed <- Visits[DayOfWeek=="Wednesday"]
visits.Thu <- Visits[DayOfWeek=="Thursday"]
visits.Fri <- Visits[DayOfWeek=="Friday"]
visits.Sat <- Visits[DayOfWeek=="Saturday"]
visits.Sun <- Visits[DayOfWeek=="Sunday"]

# Look at a boxplot of the data from Friday and Saturday
bitmap('website-boxplot.png', type="png256", width=7, height=7, 
    res=250, pointsize=14) 
par(mar=c(4.2, 4.2, 0.2, 0.2))  # (bottom, left, top, right)
boxplot(visits.Fri, visits.Sat, names=c("Friday", "Saturday"), ylab="Number of visits", 
        cex.lab=1.5, cex.main=1.8, cex.sub=1.8, cex.axis=1.8)
dev.off()

# Use the "group_difference" function from question 4
group_difference(visits.Sat, visits.Fri)
# z = 3.104152
# t.critical = 0.9985255 (1-0.001474538)

# All differences: z-values
# ----------------------------
#      Mon        Tue      Wed      Thu      Fri       Sat        Sun
# Mon  0.0000000        NA       NA       NA       NA        NA   NA
# Tue -0.2333225  0.000000       NA       NA       NA        NA   NA
# Wed -0.7431203 -0.496627 0.000000       NA       NA        NA   NA
# Thu  0.8535025  1.070370 1.593312 0.000000       NA        NA   NA
# Fri  2.4971347  2.683246 3.249602 1.619699 0.000000        NA   NA
# Sat  5.4320361  5.552498 6.151868 4.578921 3.104152  0.000000   NA
# Sun  3.9917201  4.141035 4.695493 3.166001 1.691208 -1.258885    0

Question 18

You plan to run a series of 22 experiments to measure the economic advantage, if any, of switching to a corn-based raw material, rather than using your current sugar-based material. You can only run one experiment per day, and there is a high cost to change between raw material dispensing systems. Describe two important precautions you would implement when running these experiments, so you can be certain your results will be accurate.

Question 19

There are two analytical techniques for measuring biochemical oxygen demand (BOD). You wish to evaluate the two testing procedures, so that you can select the test which has lower cost, and fastest turn-around time, but without a compromise in accuracy. The table contains the results of the each test, performed on a sample that was split in half.

Is there a statistical difference in accuracy between the two methods?

Review the raw data and answer whether there is a practical difference in accuracy.

Dilution method

Manometric method

11

25

26

3

18

27

16

30

20

33

12

16

8

28

26

27

12

12

17

32

14

16

Question 20

Plot the cumulative probability function for the normal distribution and the $t$ -distribution on the same plot.

Use 6 degrees of freedom for $t$ -distribution.

Repeat the plot for a larger number of degrees of freedom.

At which point is the $t$ -distribution indistinguishable from the normal distribution?

What is the practical implication of this result?

Solution Click to show answer

z <- seq(-5, 5, 0.1)
norm <- pnorm(z)

bitmap('normal-t-comparison.png', type="png256", width=12, height=7, 
    res=300, pointsize=14) 
par(mar=c(4.2, 4.2, 2.2, 0.2))

layout(matrix(c(1,2), 1, 2))
plot(z, norm, type="p", pch=".", cex=5, main="Normal and t-distribution (df=6)", 
   ylab="Cumulative probability")
lines(z, pt(z, df=6), type="l", lwd=2)
legend(0.5, y=0.35, legend=c("Normal distribution", "t-distribution (df=8)"), 
   pch=c(".", "-"), pt.cex=c(5, 2))

plot(z, norm, type="p", pch=".", cex=5, main="Normal and t-distribution (df=35)", 
   ylab="Cumulative probability")
lines(z, pt(z, df=35), type="l", lwd=2)
legend(0.5, y=0.35, legend=c("Normal distribution", "t-distribution (df=35)"), 
   pch=c(".", "-"), pt.cex=c(5, 2))
dev.off()

The above source code and figure output shows that the $t$ -distribution starts being indistinguishable from the normal distribution after about 35 to 40 degrees of freedom. This means that when we deal with large sample sizes (over 40 or 50 samples), then we can use critical values from the normal distribution rather than the $t$ -distribution. Furthermore, it indicates that our estimate of the variance is a pretty good estimate of the population variance for largish sample sizes.

Question 21

Explain why tests of differences are insensitive to unit changes. If this were not the case, then one could show a significant difference for a weight-loss supplement when measuring waist size in millimetres, yet show no significant difference when measuring in inches!

Question 22

A food production facility fills bags with potato chips. The advertised bag weight is 35.0 grams. But, the current bagging system is set to fill bags with a mean weight of 37.4 grams, and this done so that only 1% of bags have a weight of 35.0 grams or less.

Back-calculate the standard deviation of the bag weights, assuming a normal distribution.

Out of 1000 customers, how many are lucky enough to get 40.0 grams or more of potato chips in their bags?

Short answer: Click to show answer

Question 23

A food production facility fills bags with potato chips with an advertised bag weight of 50.0 grams.

The government’s Weights and Measures Act requires that at most 1.5% of customers may receive a bag containing less than the advertised weight. At what setting should you put the target fill weight to meet this requirement exactly? The check-weigher on the bagging system shows the long-term standard deviation for weight is about 2.8 grams.
Out of 100 customers, how many are lucky enough to get 55.0 grams or more of potato chips in their bags?

Question 24

The following confidence interval is reported by our company for the amount of sulphur dioxide measured in parts per billion (ppb) that we send into the atmosphere.

$123.6 ppb \leq μ \leq 240.2 ppb$

Only $n = 21$ raw data points (one data point measured per day) were used to calculate that 90% confidence interval. A $z$ -value would have been calculated as an intermediate step to get the final confidence interval, where $z = \frac{\overset{―}{x} - μ}{s / \sqrt{n}}$ .

What assumptions were made about those 21 raw data points to compute the above confidence interval?
Which lower and upper critical values would have been used for $z$ ? That is, which critical values are used before unpacking the final confidence interval as shown above.
What is the standard deviation, $s$ , of the raw data?
Today’s sulphur dioxide reading is 460 ppb and your manager wants to know what’s going on; you can quickly calculate the probability of seeing a value of 460 ppb, or greater, to help judge the severity of the pollution. How many days in a 365 calendar-day year are expected to show a sulphur dioxide value of 460 ppb or higher?
Explain clearly why a wide confidence interval is not desirable, from an environmental perspective.

Solution Click to show answer

The 21 data points are independent and come from any distribution of finite variance.
From the $t$ -distribution at 20 degrees of freedom, with 5% in each tail: $c_{t} = 1.72$ = qt(0.95, df=20). The $t$ -distribution is used because the standard deviation is estimated, rather than being a population deviation.
The standard deviation may be calculated from:

$\begin{aligned} U B - L B = 240.2 - 123.6 = 2 \times c_{t} \frac{s}{\sqrt{n}} & = (2) (1.72) \frac{s}{\sqrt{n}} \\ s & = \frac{(116) (\sqrt{n})}{(2) (1.72)} \\ s & = 154.5 ppb \end{aligned}$

Note the very large standard deviation relative to the confidence interval range. This is the reason why so many data points were taken (21), to calculate the average, because the raw data comes from a distribution with such a large variation.

An important note here is the large estimated value for the standard deviation and realized it was so wide, that it would imply the distribution produced values with negative sulphur dioxide concentration (which is physically impossible). However, note that when dealing with large samples (21 in this case), the distinction between the normal and the $t$ -distribution is minimal. Further, the raw data are not necessarily assumed to be from the normal distribution, they could be from any distribution, including one that is heavy-tailed, such as the F-distribution (see the yellow and green lines in particular).
The probability calculation requires a mean value. Our best guess for the mean is the midpoint of the confidence interval, which is always symmetric about the estimated process mean, $\overset{―}{x} = \frac{240.2 - 123.6}{2} + 123.6 = 181.9$ . Note that this is not the value for $μ$ , since $μ$ is unknown.

$z = \frac{460 - 181.9}{154.5} = 1.80$

Probability is 1 - pt(1.8, df=20) = $1 - 0.9565176 = 0.0434824$ , or about $0.0434824 \times 365 = 15.9$ , or about 16 days in the year (some variation is expected, if you have used a statistical table)
A wide confidence interval implies that our sulphur dioxide emissions are extremely variable (the confidence interval bounds are a strong function of the process standard deviation). Some days we are putting more pollution up into the air and balancing it out with lower pollution on other days. Those days with high pollution are more environmentally detrimental.

Question 25

A common unit operation in the pharmaceutical area is to uniformly blend powders for tablets. One such unit is illustrated below (figure taken from Wikipedia). In this question we consider blending an excipient (an inactive magnesium stearate base), a binder, and the active ingredient. The mixing process is tracked using a wireless near infrared (NIR) probe embedded in a V-blender. The mixer is stopped when the NIR spectra become stable. A new supplier of magnesium stearate is being considered that will save $ 294,000 per year.

The 15 most recent runs with the current magnesium stearate supplier had an average mixing time of 2715 seconds, and a standard deviation of 390 seconds. So far you have run 6 batches from the new supplier, and the average mixing time of these runs is 3115 seconds with a standard deviation of 452 seconds. Your manager is not happy with these results so far - this extra mixing time will actually cost you more money via lost production.

The manager wants to revert back to the original supplier, but is leaving the decision up to you; what would be your advice? Show all calculations and describe any additional assumptions, if required.

Short answer: Click to show answer

Question 26

List an advantage of using a paired test over an unpaired test. Give an example, not from the notes, that illustrates your answer.

Question 27

An unpaired test to distinguish between group A and group B was performed with 18 runs: 9 samples for group A and 9 samples for group B. The pooled variance was 86 units.

Also, a paired test on group A and group B was performed with 9 runs. After calculating the paired differences, the variance of these differences was found to be 79 units.

Discuss, in the context of this example, an advantage of paired tests over unpaired tests. Assume 95% confidence intervals, and that the true result was one of “no significant difference between method A and method B”. Give numeric values from this example to substantiate your answer.

Question 28

You are convinced that a different impeller (mixing blade) shape for your tank will lead to faster, i.e. shorter, mixing times. The choices are either an axial blade or a radial blade, as shown in this figure from Wikipedia.

Before obtaining approval to run some experiments, your team wants you to explain how you will interpret the experimental data. Your reply is that you will calculate the average mixing time from each blade type and then calculate a confidence interval for the difference. A team member asks you what the following 95% confidence intervals would mean:

$- 453 ~seconds \leq μ_{Axial} - μ_{Radial} \leq 390 ~seconds$

$- 21 ~seconds \leq μ_{Axial} - μ_{Radial} \leq 187 ~seconds$

For both cases (a) explain what the confidence interval means in the context of this experiment, and (b) whether the recommendation would be to use radial or axial impellers to get the shortest mixing time.

3. Now assume the result from your experimental test was $- 21 ~seconds \leq μ_{Axial} - μ_{Radial} \leq 187 ~seconds$ ; how can you make the confidence interval narrower?

Question 29

The paper by PJ Rousseeuw, “Tutorial to Robust Statistics”, Journal of Chemometrics, 5, 1-20, 1991 discusses the breakdown point of a statistic.

Describe what the breakdown point is, and give two examples: one with a low breakdown point, and one with a high breakdown point. Use a vector of numbers to help illustrate your answer.
What is an advantage of using robust methods over their “classical” counterparts?

Solution Click to show answer

PJ Rousseeuw defines the breakdown point on page 3 of his paper as “… the smallest fraction of the observations that have to be replaced to make the estimator unbounded. In this definition one can choose which observations are replaced, as well as the magnitude of the outliers, in the least favourable way”.

A statistic with a low breakdown point is the mean, of the $n$ values used to calculate the mean, only 1 needs to be replaced to make the estimator unbounded; i.e. its breakdown point is $1 / n$ . The median though has a breakdown point of 50%, as one would have to replace 50% of the $n$ data points in the vector before the estimator becomes unbounded.

Use this vector of data as an example: $[2, 6, 1, 9151616, - 4, 2]$ . The mean is 1525270, while the median is 2.
- Robust methods are insensitive to outliers, which is useful when we need a measure of location or spread that is calculated in an automated way. It is increasingly prevalent to skip out the “human” step that might have detected the outlier, but our data sets are getting so large that we can’t possibly visualize or look for outliers manually anymore.
- As described in the above paper by Rousseeuw, robust methods also emphasize outliers. Their “lack of sensitivity to outliers” can also be considered an advantage.

Question 30

Why are robust statistics, such as the median or MAD, important in the analysis of modern data sets? Explain, using an example, if necessary.
What is meant by the break-down point of a robust statistic? Give an example to explain your answer.

Solution Click to show answer

Data sets you will have to deal with in the workplace are getting larger and larger (lengthwise), and processing them by trimming outliers (see Question 5 later) manually is almost impossible. Robust statistics are a way to summarize such data sets without point-by-point investigation.

This is especially true for automatic systems that you will build that need to (a) acquire and (b) process the data to then (c) produce meaningful output. These systems have to be capable of dealing with outliers and missing values.
The breakdown point is the number of contaminating data points required before a statistic (estimator) becomes unbounded, i.e. useless. For example, the mean requires only 1 contaminating value, while the median requires 50% + 1 data points before it becomes useless.

Consider the sequence $[2, 6, 1, 91511, - 4, 2]$ . The mean is 15253, while the median is 2, which is a far more useful estimate of the central tendency in the data.

Question 31

Recall that $μ = E (x) = \frac{1}{N} \sum x$ and $V {x} = E {(x - μ)^{2}} = σ^{2} = \frac{1}{N} \sum (x - μ)^{2}$ .

What is the expected value thrown of a fair, 12-sided dice?

What is the expected variance of a fair, 12-sided dice?

Simulate 10,000 throws in a software package (R, MATLAB, or Python) from this dice and see if your answers match those above. Record the average value from the 10,000 throws, call that average $\overset{―}{x}$ .

Repeat the simulation 10 times, calculating the average value of all the dice throws. Calculate the mean and standard deviation of the 10 $\overset{―}{x}$ values and comment whether the results match the theoretically expected values.

Solution Click to show answer

The objective of this question is to recall basic probability rules.

Each value on the dice is equally probable, so the expected value thrown will be:

$E (X) = \sum_{i = 1}^{12} x_{i} P (x_{i}) = P (x) \sum_{i = 1}^{12} x_{i} = \frac{1}{12} (1 + 2 + \dots + 12) = 6.5$

This value is the population mean, $μ$ .
Continuing the notation from the above question we can derive the expected variance as,

$V (X) = \frac{1}{N} \sum_{i}^{12} (x_{i} - μ)^{2} = \frac{1}{12} \cdot [(1 - 6.5)^{2} + (2 - 6.5)^{2} + \dots + (12 - 6.5)^{2}] \approx 11.9167$

Simulating 10,000 throws corresponds to 10,000 independent and mutually exclusive random events, each with an outcome between 1 and 12. The sample mean and variance from my sample was calculated using this code in R:

\begin{aligned} \overset{―}{x} & = 6.5219 \\ s^{2} & = 12.03732 \end{aligned}

# Set the random seed to a known point, to allow
# us to duplicate pseudorandom results
set.seed(13)

x.data <- as.integer(runif(10000, 1, 13))

# Verify that it is roughly uniformly distributed
# across 12 bins
hist(x.data, breaks=seq(0,12))

x.mean <- mean(x.data)
x.var <- var(x.data)
c(x.mean, x.var)

Repeating the above simulation 10 times (i.e. 10 independent experiments) produces 10 different estimates of $μ$ and $σ^{2}$ . Note, your answer should be slightly different, and different each time you run the simulation.

N <- 10
n <- 10000
x.mean <- numeric(N)
x.var <- numeric(N)
for (i in 1:N) {
  x.data <- as.integer(runif(n, 1, 13))
  x.mean[i] <- mean(x.data)
  x.var[i] <- var(x.data)
}

x.mean
# [1] 6.5527 6.4148 6.4759 6.4967 6.4465 
# [6] 6.5062 6.5171 6.4671 6.5715 6.5485

x.var
# [1] 11.86561 11.84353 12.00102 11.89658 11.82552 
# [6] 11.83147 11.95224 11.88555 11.81589 11.73869

# You should run the code several times and verify whether
# the following values are around their expected, theoretical
# levels.  Some runs should be above, and other runs below 
# the theoretical values.  
# This is the same as increasing "N" in the first line.

# Is it around 6.5?
mean(x.mean)

# Is it around 11.9167?
mean(x.var)

# Is it around \sigma^2 / n = 11.9167/10000 = 0.00119167 ?
var(x.mean)

Note that each $\overset{―}{x} \sim N (μ, σ^{2} / n)$ , where $n = 10000$ . We know what $σ^{2}$ is in this case: it is our theoretical value of 11.92, calculated earlier, and for $n = 10000$ samples, our theoretical expectation is that $\overset{―}{x} \sim N (6.5, 0.00119167)$ .

Calculating the average of those 10 means, let’s call that $\overset{―}{\overset{―}{x}}$ , shows a value close to 6.5, the theoretical mean.

Calculating the variance of those 10 means shows a number around 0.00119167, as expected.

Question 32

Removed. Was a duplicate of a prior question (number 13).

Question 33

At the 95% confidence level, for a sample size of 7, compare and comment on the upper and lower bounds of the confidence interval that you would calculate if:
1. you know the population standard deviation
2. you have to estimate it for the sample.
Assume that the calculated standard deviation from the sample, $s$ matches the population $σ = 4.19$ .
As a follow up, overlay the probability distribution curves for the normal and $t$ -distribution that you would use for a sample of data of size $n = 7$ .
Repeat part of this question, using larger sample sizes. At which point does the difference between the $t$ - and normal distributions become practically indistinguishable?
What is the implication of this?

Question 34

Engineering data often violate the assumption of independence. In this question you will create (simulate) sequences of autocorrelated data, i.e. data that lack independence, and investigate how lack of independence affects our results.

The simplest form of autocorrelation is what is called lag-1 autocorrelation, when the series of values, $x_{k}$ is correlated with itself only 1 step back in time, $x_{k - 1}$ :

x_{k} = ϕ x_{k - 1} + a_{k}

The $a_{k}$ value is a random error and for this question let $a_{k} \sim N (μ = 0, σ^{2} = 25.0)$ .

Create 3 sequences of autocorrelated data with:

A: $ϕ = + 0.7$ (positively correlated)

B: $ϕ = 0.0$ (uncorrelated data)

C: $ϕ = - 0.6$ (negatively correlated)

For case A, B and C perform the following analysis. Repeat the following 1000 times (let $i = 1, 2, \dots, 1000$ ):

Create a vector of 100 autocorrelated $x$ values using the above formula, using the current level of $ϕ$

Calculate the mean of these 100 values, call it ${\overset{―}{x}}_{i}$ and store the result

At this point you have 1000 ${\overset{―}{x}}_{i}$ values for case A, another 1000 ${\overset{―}{x}}_{i}$ values for case B, and similarly for case C. Now answer these questions:

Assuming independence, which is obviously not correct for 2 of the 3 cases, nevertheless, from which population should $\overset{―}{x}$ be from, and what are the 2 parameters of that population?
Now, using your 1000 simulated means, estimate those two population parameters.
Compare your estimates to the theoretical values.

Comment on the results, and the implication of this regarding tests of significance (i.e. statistical tests to see if a significant change occurred or not).

Solution Click to show answer

We expect that case B should match the theoretical case the closest, since data from case B are truly independent, since the autocorrelation parameter is zero. We expect case A and C datasets, which violate that assumption of independence, to be biased one way or another. This question aims to see how they are biased.

nsim <- 1000            # Number of simulations
x.mean <- numeric(nsim) # An empty vector to store the results

set.seed(37)            # so that you can reproduce these results
for (i in 1:nsim)
{
    N <- 100            # number of points in autocorrelated sequence
    phi <- +0.7         # ** change this line for case A, B and C **
    spread <- 5.0       # standard deviation of random variables
    x <- numeric(N)
    x[1] = rnorm(1, mean=0, sd=spread)
    for (k in 2:N){
       x[k] <- phi*x[k-1] + rnorm(1, mean=0, sd=spread)
    }
    x.mean[i] <- mean(x)
}
theoretical <- sqrt(spread^2/N)

# Show some output to the user
c(theoretical, mean(x.mean), sd(x.mean))

You should be able to reproduce the results I have below, because the above code uses the set.seed(...) function, which forces R to generate random numbers in the same order on my computer as yours (as long as we all use the same version of R).

Case A: 0.50000000, 0.00428291, 1.65963302
Case B: 0.50000000, 0.001565456, 0.509676562
Case C: 0.50000000, 0.0004381761, 0.3217627596

The first output is the same for all 3 cases: this is the theoretical standard deviation of the distribution from which the ${\overset{―}{x}}_{i}$ values come: ${\overset{―}{x}}_{i} \sim N (μ, σ^{2} / N)$ , where $N = 100$ , the number of points in the autocorrelated sequence. This result comes from the central limit theorem, which tells us that ${\overset{―}{x}}_{i}$ should be normally distributed, with the same mean as our individual $x$ -values, but have smaller variance. That variance is $σ^{2} / N$ , where $σ$ is the variance of the distribution from which we took the raw $x$ values. That theoretical variance value is $25 / 100$ , or theoretical standard deviation of $\sqrt{25 / 100} = 0.5$ .

But, the central limit theorem only has one crucial assumption: that those raw $x$ values are independent. We intentionally violated this assumption for case A and C.

We use the 1000 simulated values of ${\overset{―}{x}}_{i}$ and calculate the average of the 1000 ${\overset{―}{x}}_{i}$ values and the standard deviation of the 1000 ${\overset{―}{x}}_{i}$ values. Those are the second and third values reported above.

We see in all cases that the mean of the 1000 values nearly matches 0.0. If you run the simulations again, with a different seed, you will see it above zero, and sometimes below zero for all 3 cases. So we can conclude that lack of independence does not affect the estimated mean.

The major disagreement is in the variance though. Case B matches the theoretical variance; data that are positively correlated have an inflated standard deviation, 1.66; data that are negatively correlated have a deflated standard deviation, 0.32 when $ϕ = - 0.6$ .

This is problematic for the following reason. When doing a test of significance, we construct a confidence interval:

\begin{array}{r} \begin{array}{rcccl} - c_{t} & \leq & \frac{\overset{―}{x} - μ}{s / \sqrt{n}} & \leq & + c_{t} \\ \overset{―}{x} - c_{t} \frac{s}{\sqrt{n}} & \leq & μ & \leq & \overset{―}{x} + c_{t} \frac{s}{\sqrt{n}} \\ LB & \leq & μ & \leq & UB \end{array} \end{array}

We use an estimated standard deviation, $s$ , whether that is found from pooling the variances or found separately (it doesn’t really matter), but the main problem is that $s$ is not accurate when the data are not independent:

For positive correlations (quite common in industrial data): our confidence interval will be too wide, likely spanning zero, indicating no statistical difference, when in fact there might be one.
For negative correlations (less common, but still seen in practice): our confidence interval will be too narrow, more likely to indicate there is a difference.

The main purpose of this question is for you to see how use to understand what happens when a key assumption is violated. There are cases when an assumption is violated, but it doesn’t affect the result too much.

In this particular example there is a known theoretical relationship between $ϕ$ and the inflated/deflated variance that can be derived (with some difficulty). But in most situations the affect of violating assumptions is too difficult to derive mathematically, so we use computer power to do the work for us: but then we still have to spend time thinking and interpreting the results.

Question 35

Sulphur dioxide is a byproduct from ore smelting, coal-fired power stations, and other sources.

These 11 samples of sulphur dioxide, SO₂, measured in parts per billion [ppb], were taken from our plant. Environmental regulations require us to report the 90% confidence interval for the mean SO₂ value.

$180, 340, 220, 410, 101, 89, 210, 99, 128, 113, 111$

What is the confidence interval that must be reported, given that the sample average of these 11 points is 181.9 ppb and the sample standard deviation is 106.8 ppb?
Why might Environment Canada require you to report the confidence interval instead of the mean?

Question 36

A concrete slump test is used to test for the fluidity, or workability, of concrete. It’s a crude, but quick test often used to measure the effect of polymer additives that are mixed with the concrete to improve workability.

The concrete mixture is prepared with a polymer additive. The mixture is placed in a mold and filled to the top. The mold is inverted and removed. The height of the mold minus the height of the remaining concrete pile is called the “slump”, as shown in this figure from Wikipedia.

../figures/least-squares/concrete-slump.svg

Your company provides the polymer additive, and you are developing an improved polymer formulation, call it B, that hopefully provides the same slump values as your existing polymer, call it A. Formulation B costs less money than A, but you don’t want to upset, or lose, customers by varying the slump value too much.

You have a single day to run your tests (experiments). Preparation, mixing times, measurement and clean up take 1 hour, only allowing you to run 10 experiments. Describe all precautions, and why you take these precautions, when planning and executing your experiment. Be very specific in your answer (use bullet points).
The following slump values were recorded over the course of the day:

Additive

Slump value [cm]

A

5.2

A

3.3

B

5.8

A

4.6

B

6.3

A

5.8

A

4.1

B

6.0

B

5.5

B

4.5

What is your conclusion on the performance of the new polymer formulation (system B)? Your conclusion must either be “send the polymer engineers back to the lab” or “let’s start making formulation B for our customers”. Explain your choice clearly.

To help you, ${\overset{―}{x}}_{A} = 4.6$ and $s_{A} = 0.97$ . For system B: ${\overset{―}{x}}_{B} = 5.62$ and $s_{B} = 0.69$ .

Note: In your answer you must be clear on which assumptions you are using and, where necessary, why you need to make those assumptions.
Describe the circumstances under which you would rather use a paired test for differences between polymer A and B.
What are the advantage(s) of the paired test over the unpaired test?
Clearly explain which assumptions are used for paired tests, and why they are likely to be true in this case?
The slump tests were actually performed in a paired manner, where pairing was performed based on the cement supplier. Five different cement suppliers were used:

Supplier

Slump value [cm] from A

Slump value [cm] from B

1

5.2

5.8

2

3.3

4.5

3

4.6

6.0

4

5.8

5.5

5

4.1

6.2

Use these data, and provide, if necessary, an updated recommendation to your manager.

Question 37

You are planning a series of experiments to test alternative conditions in a store and see which conditions lead to higher sales.

Which practical steps would you take to ensure independence in the experimental data, when investigating:

adjustable halogen lighting: A = soft and dim lighting and B = brighter lighting
alternative shelving: A = solid white metal shelves and B = commercial stainless steel racking

Solution Click to show answer

Question 38

This question gives you exposure to analyzing a larger data set than seen in the preceding questions.

Your manager has asked you to describe the flow rate characteristics of the overhead stream leaving the top of the distillation column at your plant. You are able to download one month of data, available from this website, from 1 March to 31 March, taken at one minute intervals to answer this question.

Additive	Slump value [cm]
A	5.2
A	3.3
B	5.8
A	4.6
B	6.3
A	5.8
A	4.1
B	6.0
B	5.5
B	4.5

Supplier	Slump value [cm] from A	Slump value [cm] from B
1	5.2	5.8
2	3.3	4.5
3	4.6	6.0
4	5.8	5.5
5	4.1	6.2