Software tutorial/Calculating statistics from a data sample

From Statistics for Engineering
< Software tutorial
Revision as of 03:35, 15 January 2013 by Kevin Dunn (talk | contribs) (Created page with "{{Navigation|Book=Software tutorial|previous=Dealing with factors (categorical variables)|current=Tutorial index|next=Dealing with distributions}} __NOTOC__ <rst> <rst-option...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
← Dealing with factors (categorical variables) (previous step) Tutorial index Next step: Dealing with distributions →


<rst> <rst-options: 'toc' = False/> <rst-options: 'reset-figures' = False/> Load a data set, for example the `Website traffic <http://datasets.connectmv.com/info/website-traffic>`_ data:


.. code-block:: s

# Over the internet website <- read.csv('http://datasets.connectmv.com/file/website-traffic.csv')

# or from your hard drive website <- read.csv('C:/StatsCourse/Data/website-traffic.csv')


# Take a quick look at the data to make sure it's what we expect ... summary(website) DayOfWeek MonthDay Year Visits Friday :30 August 1 : 1 Min. :2009 Min.  : 3.00 Monday :31 August 10: 1 1st Qu.:2009 1st Qu.:16.25 Saturday :30 August 11: 1 Median :2009 Median :22.00 Sunday :30 August 12: 1 Mean :2009 Mean :22.23 Thursday :31 August 13: 1 3rd Qu.:2009 3rd Qu.:27.75 Tuesday :31 August 14: 1 Max. :2009 Max. :48.00 Wednesday:31 (Other) :208

# Calculate the mean of the "Visits" column: visits <- website$Visits visits.mean <- mean(visits) visits.mean [1] 22.23364

# The standard deviation: use sd(...) visits.sd <- sd(visits) visits.sd [1] 8.331826

# How do the robust equivalents compare? visits.median = median(visits) visits.mad = mad(visits) c(visits.median, visits.mad) [1] 22.0000 8.8956


You can use these additional R commands to compute other summaries of interest for a sequence of data:

.. code-block:: s

# The sum sum(visits) [1] 4758

# The minimum and maximum c(min(visits), max(visits)) [1] 3 48

# Or just use the range(...) command to get the same result range(visits) [1] 3 48

# The summary(...) command we saw earlier gives all this, as well as the # 1st and 3rd quartiles. Here's another way to summarize a variable: quantile(visits) 0% 25% 50% 75% 100% 3.00 16.25 22.00 27.75 48.00

# It gives the 0, 0.25, 0.50, 0.75 and 1.00 sample quantiles at those # probabilities. If you want to specify your own probability: quantile(visits, 0.32) 32% 18

# So 32% of the observations in this data recored a value of 18 or # fewer visits to the website.

# Recall the interquartile range is the distance from the 3rd to the 1st quartile: visits.iqr <- quantile(visits, 0.75) - quantile(visits, 0.25) # 11.5

# or, you can calculate it more directly using the IQR(...) function: visits.iqr <- IQR(visits) # 11.5

# Type help(IQR) to see how to compare the IQR to the 2 other measures of spread: # the standard deviation and the median absolute deviation (MAD) </rst>