Difference between revisions of "Assignment 4 - 2014"

From Statistics for Engineering
Jump to navigation Jump to search
 
Line 14: Line 14:
<rst-options: 'toc' = False/>
<rst-options: 'toc' = False/>
<rst-options: 'reset-figures' = False/>
<rst-options: 'reset-figures' = False/>
.. note:: I strongly recommend you submit this `assignment electronically <http://learnche.mcmaster.ca/4C3/Electronic_submissions_-_2014>`_ (see instructions on the course website), so that you can practice using the system for the course project.
.. note:: I strongly recommend you submit this `assignment electronically <Electronic_submissions_-_2014>`_ (see instructions on the course website), so that you can practice using the system for the course project.


.. question::
.. question::

Latest revision as of 05:49, 20 September 2018

Due date(s): 27 February 2014, in class
Nuvola mimetypes pdf.png (PDF) Assignment questions

<rst> <rst-options: 'toc' = False/> <rst-options: 'reset-figures' = False/> .. note:: I strongly recommend you submit this `assignment electronically <Electronic_submissions_-_2014>`_ (see instructions on the course website), so that you can practice using the system for the course project.

.. question:: :grading: 8

The `Paper basis dataset <http://openmv.net/info/paper-basis-weight>`_ contains data, sampled 30 seconds apart, of the basis weight (a measure of paper density) from an industrial source.

Show the autocorrelation plot for the data, and interpret the plot.

600-level students: also confirm that your interpretation is correct by sub-sampling the vector and repeating the autocorrelation test. *Hint*: use the ``seq(start_from, end_at, step_size)`` command in R to subsample a vector.


.. answer:: :fullinclude: no

The autocorrelation plot shows significant lags up to lag 2, or even 3. This indicates that data up to 3 entries apart (:math:`3 \times 30 = 90` seconds apart) are correlated. Thus every


So subsampling the vector with every 4th or 5th element should yield independent samples. The autocorrelation with every 5th observation confirms this. You could also use every 6th, 7th, *etc* observation. Using every 30th observation though is not too useful, since it would lead to a long delay before the control chart showed any problems.

.. image:: ../figures/least-squares/kappa-number-autocorrelation.png :align: center :width: 750px :scale: 50

The ACF plot indicates that there is significant reappearance of correlation around lags 9 to 15. It wasn't required for you to identify why for this assignment, but usually this would be related to a recycle stream that reenters a reactor, or due to an oscillation in a control loop.

You can also verify the autocorrelation by plotting scatterplots of the vector against itself. The first plot below shows what an ACF coefficient of 1.0 means, while the second plot shows what it means to use a lag offset of 1 position. The correlation value = :math:`\sqrt{R^2}` is shown on each plot. Compare that value shown to the y-axis of the ACF plots.

.. image:: ../figures/least-squares/kappa-number-autocorrelation-scatterplots.png :align: center :width: 900px :scale: 100

.. literalinclude:: ../figures/least-squares/kappa-number-autocorrelation.R :language: s :lines: 1-9,13-15,21-37

.. question:: :grading: 6, for 600-level students only

Another interesting data set is the `Aeration rate <http://openmv.net/info/aeration-rate>`_ is the amount of air added to a `sparging tank <http://en.wikipedia.org/wiki/Sparging_(chemistry)>`_.

Use the autocorrelation function on this data set, show the plot, and carefully interpret what the results imply. *Hint*: you will notice there is a missing value in the data set, so use the ``na.action=na.omit`` as the second input into the ``acf(...)`` function.


.. question:: :grading: 16

This question uses two data sets. You may answer the question using either one of the data sets (your choice), however 600-level students are expected to use both data sets and compare the results side-by-side (i.e. don't repeat your analysis a second time below the first, do your analysis on both datasets simultaneously, making comparisons between the two data sets). Even 400-level students are encouraged to examine both data sets. Your answer may not exceed 4 pages.

#. Data from `CHEM ENG 4M3, 2013 class <http://openmv.net/info/unlimited-time-test-2>`_ #. Data from `CHEM ENG 4N4, 2013 class <http://openmv.net/info/unlimited-time-test-3>`_

The data is related to the time duration of students writing midterm tests, and it also records the grade the student achieved on the test. One column in the data is labelled as ``Grade`` [percentage] and the other as ``Time`` [minutes].

This question is of an exploratory nature. You may consider the ideas below, but also feel free to add to these:

* Explore the data: is there a relationship between the grade and time taken to write the test? (Consider learning about and using the ``lowess`` function in R.) How would you describe the relationship? * Should a regression model use ``Time`` or ``Grade`` as the input variable? * Build a suitable regression model using these two variables. What conclusions do you draw from the model? * Investigate whether the assumptions for regression models hold true. * What advice would you give to students based on these results? * What result(s) do you learn from these data that is(are) useful for course instructors to know? </rst>