Difference between revisions of "Software tutorial/Calculating statistics from a data sample"
Kevin Dunn (talk | contribs) (Created page with "{{Navigation|Book=Software tutorial|previous=Dealing with factors (categorical variables)|current=Tutorial index|next=Dealing with distributions}} __NOTOC__ <rst> <rst-option...") |
Kevin Dunn (talk | contribs) |
||
Line 5: | Line 5: | ||
<rst-options: 'toc' = False/> | <rst-options: 'toc' = False/> | ||
<rst-options: 'reset-figures' = False/> | <rst-options: 'reset-figures' = False/> | ||
Load a data set, for example the `Website traffic <http:// | Load a data set, for example the `Website traffic <http://openmv.net/info/website-traffic>`_ data: | ||
Line 11: | Line 11: | ||
# Over the internet | # Over the internet | ||
website <- read.csv('http:// | website <- read.csv('http://openmv.net/file/website-traffic.csv') | ||
# or from your hard drive | # or from your hard drive | ||
Line 86: | Line 86: | ||
# the standard deviation and the median absolute deviation (MAD) | # the standard deviation and the median absolute deviation (MAD) | ||
</rst> | </rst> | ||
{{Navigation|Book=Software tutorial|previous=Dealing with factors (categorical variables)|current=Tutorial index|next=Dealing with distributions}} |
Latest revision as of 09:41, 13 January 2016
<rst>
<rst-options: 'toc' = False/>
<rst-options: 'reset-figures' = False/>
Load a data set, for example the `Website traffic <http://openmv.net/info/website-traffic>`_ data:
.. code-block:: s
# Over the internet website <- read.csv('http://openmv.net/file/website-traffic.csv')
# or from your hard drive website <- read.csv('C:/StatsCourse/Data/website-traffic.csv')
# Take a quick look at the data to make sure it's what we expect ...
summary(website)
DayOfWeek MonthDay Year Visits
Friday :30 August 1 : 1 Min. :2009 Min. : 3.00
Monday :31 August 10: 1 1st Qu.:2009 1st Qu.:16.25
Saturday :30 August 11: 1 Median :2009 Median :22.00
Sunday :30 August 12: 1 Mean :2009 Mean :22.23
Thursday :31 August 13: 1 3rd Qu.:2009 3rd Qu.:27.75
Tuesday :31 August 14: 1 Max. :2009 Max. :48.00
Wednesday:31 (Other) :208
# Calculate the mean of the "Visits" column: visits <- website$Visits visits.mean <- mean(visits) visits.mean [1] 22.23364
# The standard deviation: use sd(...) visits.sd <- sd(visits) visits.sd [1] 8.331826
# How do the robust equivalents compare? visits.median = median(visits) visits.mad = mad(visits) c(visits.median, visits.mad) [1] 22.0000 8.8956
You can use these additional R commands to compute other summaries of interest for a sequence of data:
.. code-block:: s
# The sum sum(visits) [1] 4758
# The minimum and maximum c(min(visits), max(visits)) [1] 3 48
# Or just use the range(...) command to get the same result range(visits) [1] 3 48
# The summary(...) command we saw earlier gives all this, as well as the # 1st and 3rd quartiles. Here's another way to summarize a variable: quantile(visits) 0% 25% 50% 75% 100% 3.00 16.25 22.00 27.75 48.00
# It gives the 0, 0.25, 0.50, 0.75 and 1.00 sample quantiles at those # probabilities. If you want to specify your own probability: quantile(visits, 0.32) 32% 18
# So 32% of the observations in this data recored a value of 18 or # fewer visits to the website.
# Recall the interquartile range is the distance from the 3rd to the 1st quartile: visits.iqr <- quantile(visits, 0.75) - quantile(visits, 0.25) # 11.5
# or, you can calculate it more directly using the IQR(...) function: visits.iqr <- IQR(visits) # 11.5
# Type help(IQR) to see how to compare the IQR to the 2 other measures of spread: # the standard deviation and the median absolute deviation (MAD) </rst>