Software tutorial/Basic data manipulation in R

From Statistics for Engineering
Jump to: navigation, search
← Reading data into R (previous step) Tutorial index Next step: Basic plots in R →

Continuing the previous example: when you loaded the website data you saw there were 4 columns (DayOfWeek, MonthDay, Year, Visits) and 214 rows. You can get this information more quickly:

website <- read.csv('http://openmv.net/file/website-traffic.csv')
ncol(website)
[1] 4
nrow(website)
[1] 214

To get a summary of each column in the data frame (that is the term R uses for a collection of data):

summary(website)

    DayOfWeek        MonthDay        Year          Visits
Friday   :30    August 1 :  1   Min.   :2009   Min.   : 3.00
Monday   :31    August 10:  1   1st Qu.:2009   1st Qu.:16.25
Saturday :30    August 11:  1   Median :2009   Median :22.00
Sunday   :30    August 12:  1   Mean   :2009   Mean   :22.23
Thursday :31    August 13:  1   3rd Qu.:2009   3rd Qu.:27.75
Tuesday  :31    August 14:  1   Max.   :2009   Max.   :48.00
Wednesday:31   (Other)   :208

Compare the summary printout above with the actual data and make sure you understand what every line means.

Let's say you are interested only in one column from the data, e.g. Visits. You can access just that column by using the $ symbol. This next code snippet shows how to calculate a summary just for the Visits variable:

summary(website$Visits)

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
3.00   16.25   22.00   22.23   27.75   48.00

Another way to access all the data from the Visits column (column 4 in the table) is:

web.visits <- website[,4]

You can interpret the above command as saying "give me all rows in the website data set and only the values in column 4"

Take a look at this new variable (note that R variables can have periods in their names)

web.visits
  [1] 27 31 38 38 31 24 21 29 30 22 24 17  7 13 20 17 11 19 15  3 12 25
 [23] 17 24 30 22 15 14 29 10 19 34 12  5 14 26  8 16 11 10 12 11 14 23
 [45] 30 19 21 14 18 27 26 27 23 16  5 18 29 35 22 22 10  7 12 23 38 43
 [67] 26 19 18 10 19 19 38 22 25 18 24 21 28 30 21 26 11 12 20 21 23 25
 [89] 19 14 17 21 38 27 21 18 19 20 18 26 28 30 28 29 16 30 23 24 44 28
[111] 20 20 16 22 31 31 30 30 29 27 37 35 22 28 23 48 46 35 40 22 26 14
[133] 19 26 25 21 29 34 15 16 19 29 32 25 24 17 23 42 28 23 27 26 22 15
[155] 32 22 29 25 15 18 28 27 35 26 26 20 22 13 22 25 29 20 12 14 13 38
[177] 35 25 24 17 22 21 32 26 30 21 27 13 14 21 19 30 16 20  8 10 13 31
[199] 24 18 17  7 13 22 22 22 13 10 12 15 24 18 10  7

What if we want to access the number in the first row and fourth column of website?

website[1, 4]
[1] 27

Or in the second row and first column?

website[2, 1]
[1] Tuesday
Levels: Friday Monday Saturday Sunday Thursday Tuesday Wednesday

Now let's say you want all rows from website where the column value for DayOfWeek is Monday.

We do this in 2 steps. First, we introduce the "==" operation, which means "is equal to"

website$DayOfWeek == "Monday"

  [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
 [13] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
 [25] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
 [37] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
 [49] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
 [61] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
 [73] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
 [85]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
 [97] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[109] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
[121] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
[133] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
[145] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
[157] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[169]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[181] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[193] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
[205] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE

It returns a logical (true/false) array with TRUE where the condition is met. Now we can use this array to access all rows where this condition is met:

Mondays.rows <- website[website$DayOfWeek == "Monday", ]
Mondays.rows

    DayOfWeek      MonthDay Year Visits
1      Monday        June 1 2009     27
8      Monday        June 8 2009     29
15     Monday       June 15 2009     20
...
204    Monday   December 21 2009     22
211    Monday   December 28 2009     24

The above command gives you all data which are recorded for Mondays. Now, what if you want to break that down further - you only want the number of visits on a Monday? Then you need to ask for column 4 only:

Mondays.visits <- website[website$DayOfWeek == "Monday", 4]

Mondays.visits
[1] 27 29 20 25 29 26 14 27 29 23 19 21 20 21 18 30 16 27 46 26 19 42 32 27 22 38 32 21 13 22 24

← Reading data into R (previous step) Tutorial index Next step: Basic plots in R →