Assignment 2 - 2011 - Solution

From Statistics for Engineering
Revision as of 17:56, 22 September 2018 by Kevin Dunn (talk | contribs)
Jump to navigation Jump to search

<rst> <rst-options: 'toc' = False/> <rst-options: 'reset-figures' = False/> .. rubric:: Assignment objectives

- A review of basic probability, histograms and sample statistics. - Collect data from multiple sources, consolidate it, and analyze it. - Deal with issues that are prevalent in real data sets. - Improve your skills with R (if you are using R for the course).

    • Notes**:

- I would normally expect you to spend between 3 and 5 hours outside of class on assignments. This assignment should take about that long. Answer with bullet points, not in full paragraphs. - **Numbers in bold** next to the question are the grading points. Read more about the `assignment grading system <http://stats4eng.connectmv.com/wiki/Assignment_grading_system>`_. - 600-level students must complete all the question; 400-level students may attempt the 600 level question for extra credit. Also 600-level students must read the paper by PJ Rousseeuw, "`Tutorial to Robust Statistics <http://dx.doi.org/10.1002/cem.1180050103>`_".

Question 1 [1]

=====

Recall from class that :math:`\mu = \mathcal{E}(x) = \frac{1}{N}\sum{x}` and :math:`\mathcal{V}\left\{x\right\} = \mathcal{E}\left\{ (x - \mu )^2\right\} = \sigma^2 = \frac{1}{N}\sum{(x-\mu)^2}`.

#. What is the expected value thrown of a fair 12-sided dice? #. What is the expected variance of a fair 12-sided dice? #. Simulate 10,000 throws in R, MATLAB, or Python from this dice and see if your answers match those above. Record the average value from the 10,000 throws. #. Repeat the simulation for the average value of the dice a total of 10 times. Calculate and report the mean and standard deviation of these 10 simulations and *comment* on the results.

Solution


The objective of this question is to recall basic probability rules.

  1. . Let :math:`X` represent a discrete random variable for the event of throwing a fair die. Let :math:`x_{i}` for :math:`i=1,\ldots,12` represent the numerical or realized values of the outcome of the random event given by :math:`X`. Now we can define the expected value of :math:`X` as,
   .. math::
       \mathcal{E}(X)=\sum_{i=1}^{12}x_{i}P(x_{i})
   where the probability of obtaining a value of :math:`1,\ldots,12` is :math:`P(x_{i})=1/N=1/12 \;\forall\; i=1,\ldots,12`. So, we have,
   .. math::
       \mathcal{E}(X)=\frac{1}{N}\sum_{i=1}^{12}x_{i}=\frac{1}{12}\left(1+2+\cdots+12\right)=\bf{6.5}
  1. . Continuing the notation from the above question we can derive the expected variance as,
   .. math::
       \mathcal{V}(X)&=\mathcal{E}\left\{[X-\mathcal{E}(X)]^{2}\right\}\\        &=\mathcal{E}(X^{2})-[\mathcal{E}(X)]^{2}
     
   where :math:`\mathcal{E}(X^{2})=\sum_{i}x_{i}^{2}P(x_{i})`. So we can now calculate :math:`\mathcal{V}(X)` as, 
   .. math::
       \mathcal{V}(X)&=\sum_{i=1}^{12}x_{i}^{2}P(x_{i})-\left[\sum_{i=1}^{12}x_{i}P(x_{i})\right]^{2}\\        &=\frac{1}{12}(1^{2}+2^{2}+\cdots+12^{12}) - [6.5]^{2}\approx \bf{11.9167}
  1. . Simulating 10,000 throws corresponds to 10,000 independent and mutually exclusive random events, each with an outcome in the set :math:`\mathcal{S}={1,2,\ldots,12}`. The sample mean and variance from my sample was:

.. math::

\overline{x} &= 6.4925\\ s^2 &= 11.77915

.. twocolumncode:: :code1: ../che4c3/Assignments/Assignment-2/code/q1c.R :language1: s :header1: R code :code2: ../che4c3/Assignments/Assignment-2/code/q1c.m :language2: matlab :header2: MATLAB code

  1. . Repeating the above simulation 10 times (i.e., 10 independent experiments) produces 10 different estimates of :math:`\mu` and :math:`\sigma^2`. Note, everyone's answer should be slightly different, and different each time you run the simulation.

.. twocolumncode:: :code1: ../che4c3/Assignments/Assignment-2/code/q1d.R :language1: s :header1: R code :code2: ../che4c3/Assignments/Assignment-2/code/q1d.m :language2: matlab :header2: MATLAB code

Note that each :math:`\overline{x} \sim \mathcal{N}\left(\mu, \sigma^2/n \right)`, where :math:`n = 10000`. We know what :math:`\sigma^2` is in this case: it is our theoretical value of **11.92**, calculated earlier, and for :math:`n=10000` samples, our :math:`\overline{x} \sim \mathcal{N}\left(6.5, 0.00119167\right)`.

Calculating the average of those 10 means, let's call that :math:`\overline{\overline{x}}`, shows values around 6.5, the theoretical mean.

Calculate the variance of those 10 means shows numbers that are around 0.00119167, as expected.

Question 2 [1.5]

=====

In the class last week I mentioned an example of independence. I said that if I take the grade for each question in an exam for a student, calculate the grade per question, then the average of those :math:`N` grades will be normally distributed, even if the grades in individual question are not. For example: if there are 10 questions, and your grades for each question was 100%, 95%, 26%, 78%, ... 87%, then your average will be as if it came from a normal distribution.

  1. . This example was faulty: what was wrong with my reasoning?
  2. . 600-level students: However, when I look at the average grades from any exam, without fail they are always normally distributed. What's going on here?

Solution


  1. . Unfortunately, I chose my example in class too hastily, without thinking about the details. The grades for every student are not independent, because that student (as long as they are not receiving external help), will likely do well in all questions, or poorly in all questions, or only well in the section(s) they have studied. So each student's grade for the individual questions will be related.
  1. . **600-level** students: The central limit theorem tells us that samples from *any distribution with finite variance* (each question in the exam has a different distribution, but has finite variance), that the average of those values (the average grade of each student) will be normally distributed, as long as we took our samples independently (which we did not have for the grades example).

So we are only breaking the independence assumption of the central limit theorem. That means we should take a look at why we've assumed independence between two sampled values.

To do this, first let's look at the case when we do have independence, and for simplicity, let's assume every question in the exam also had a normal distribution with the same mean, :math:`\mu` and the same variance, :math:`\sigma^2` (really restrictive, but you will see why in a minute). We know that this case leads to:

.. math::

\overline{x}_j \sim \mathcal{N}(\mu, \sigma^2/N)

which says the average grade for student :math:`j`, call it :math:`\overline{x}_j`, comes from a normal distribution with that mean :math:`\mu`, and with standard deviation of :math:`\sigma^2/N`, where :math:`N` is the total number of questions. This is the usual formula we have seen in class; but where did this formula come from? Recall that:

.. math::

\overline{x}_j = \frac{1}{N}x_{j,1} + \frac{1}{N}x_{j,2} + \ldots + \frac{1}{N}x_{j,N}

where each student, :math:`j`, obtained a grade for question, :math:`1, 2, \ldots, n, \ldots N`. Let's call that grade :math:`x_{j,n}`, and recall that we have assumed :math:`x_{j,n} \sim \mathcal{N}(\mu, \sigma^2)`. The mean and standard deviation of :math:`\overline{x}_j`, *crucially assuming independence* between each :math:`x_{j,n}` value, can then be found from:

.. math::

\mathcal{E}(\overline{x}_j) &= \mathcal{E}\left(\frac{1}{N}x_{j,1} + \frac{1}{N}x_{j,2} + \ldots + \frac{1}{N}x_{j,N} \right) \\ &= \frac{1}{N}\mathcal{E}(x_{j,1}) + \frac{1}{N}\mathcal{E}(x_{j,2}) + \ldots + \frac{1}{N}\mathcal{E}(x_{j,N}) \\ &= \frac{1}{N}\mu + \frac{1}{N}\mu + \ldots + \frac{1}{N}\mu \\ &= \mu \qquad\text{(this is expected)}\\ \mathcal{V}(\overline{x}_j) &= \mathcal{V}\left(\frac{1}{N}x_{j,1} + \frac{1}{N}x_{j,2} + \ldots + \frac{1}{N}x_{j,N} \right) \\ &= \frac{1}{N^2}\mathcal{V}(x_{j,1}) + \frac{1}{N^2}\mathcal{V}(x_{j,2}) + \ldots + \frac{1}{N^2}\mathcal{V}(x_{j,N}) \qquad\text{(this is why we require independence)}\\\\ &= \frac{N}{N^2}\sigma^2 \\ &= \frac{\sigma^2}{N}

This also explains where the :math:`\sigma^2/N` term, used in the :math:`t`-distribution, comes from. The above derivation relies on two properties you should be familiar with (see a good stats textbook, e.g. Box, Hunter and Hunter):

.. math::

\mathcal{V}(x + y) &= \mathcal{V}(x) + \mathcal{V}(y) + 2 \text{Cov}(x, y)\\ \mathcal{V}(x + y) &= \mathcal{V}(x) + \mathcal{V}(y) + 2\mathcal{E}\big[(x - \mathcal{E}[x])(y - \mathcal{E}[y])\big]\\ \mathcal{V}(ax) &= a^2\mathcal{V}(x) \\


and independence implies that :math:`\text{Cov}(x, y) = 0`.


So relaxing our assumption of independent :math:`x_{j,n}` values shows that we cannot combine the variances in an easy way, but we do see that the correct variance will be a larger number (if the student grades within an exam are positively correlated - the usual case), or a smaller number (if the grades are negatively correlated within each student's exam).

Also, relaxing the assumption that each question has the same variance, we just replace :math:`\sigma^2` with :math:`\sigma^2_n` in the formula for :math:`\mathcal{V}(\overline{x})`. Relaxing the assumption of equal means, :math:`\mu`, for each question requires we use :math:`\mu_j` instead of :math:`\mu`. Note that :math:`\sigma^2_n` and :math:`\mu_n` can come from *any distribution*, not just the normal distribution.

But, the central limit theorem tells us average grade for a student, :math:`\overline{x}_j`, will be as if it came from a normal distribution. However, because we do not have independence, and we don't know the individual :math:`\sigma^2_n` and :math:`\mu_n` values for each question, we cannot *estimate the parameters* of that normal distribution.

So, to conclude: it is correct that the average grades from the exam for every student will be as if they came from a normal distribution, only we can't calculate (estimate) that distribution's parameters. I always find my course grades to be normally distributed when examining the qq-plot.


</rst>