Software tutorial/Calculating statistics from a data sample

From Statistics for Engineering
Jump to: navigation, search
← Dealing with factors (categorical variables) (previous step) Tutorial index Next step: Dealing with distributions →


Load a data set, for example the Website traffic data:

# Over the internet
website <- read.csv('http://openmv.net/file/website-traffic.csv')

# or from your hard drive
website <- read.csv('C:/StatsCourse/Data/website-traffic.csv')


# Take a quick look at the data to make sure it's what we expect ...
summary(website)
     DayOfWeek        MonthDay        Year          Visits
 Friday   :30    August 1 :  1   Min.   :2009   Min.   : 3.00
 Monday   :31    August 10:  1   1st Qu.:2009   1st Qu.:16.25
 Saturday :30    August 11:  1   Median :2009   Median :22.00
 Sunday   :30    August 12:  1   Mean   :2009   Mean   :22.23
 Thursday :31    August 13:  1   3rd Qu.:2009   3rd Qu.:27.75
 Tuesday  :31    August 14:  1   Max.   :2009   Max.   :48.00
 Wednesday:31   (Other)   :208

# Calculate the mean of the "Visits" column:
visits <- website$Visits
visits.mean <- mean(visits)
visits.mean
[1] 22.23364

# The standard deviation: use sd(...)
visits.sd <- sd(visits)
visits.sd
[1] 8.331826

# How do the robust equivalents compare?
visits.median = median(visits)
visits.mad = mad(visits)
c(visits.median, visits.mad)
[1] 22.0000  8.8956

You can use these additional R commands to compute other summaries of interest for a sequence of data:

# The sum
sum(visits)
[1] 4758

# The minimum and maximum
c(min(visits), max(visits))
[1]  3 48

# Or just use the range(...) command to get the same result
range(visits)
[1]  3 48

# The summary(...) command we saw earlier gives all this, as well as the
# 1st and 3rd quartiles.  Here's another way to summarize a variable:
quantile(visits)
   0%   25%   50%   75%  100%
 3.00 16.25 22.00 27.75 48.00

# It gives the 0, 0.25, 0.50, 0.75 and 1.00 sample quantiles at those
# probabilities.  If you want to specify your own probability:
quantile(visits, 0.32)
32%
 18

# So 32% of the observations in this data recored a value of 18 or
# fewer visits to the website.

# Recall the interquartile range is the distance from the 3rd to the 1st quartile:
visits.iqr <- quantile(visits, 0.75) - quantile(visits, 0.25)  # 11.5

# or, you can calculate it more directly using the IQR(...) function:
visits.iqr <- IQR(visits)  # 11.5

# Type help(IQR) to see how to compare the IQR to the 2 other measures of spread:
# the standard deviation and the median absolute deviation (MAD)

← Dealing with factors (categorical variables) (previous step) Tutorial index Next step: Dealing with distributions →