# Univariate data analysis (2013)

Class date(s): 15 to 29 January 2013
(PDF) Course slides

## Software source code

Please follow the software tutorial to install and run the course software. Here was the example used in class:

```# Read data from a web address
```

Code used to illustrate how the q-q plot is constructed:

```N <- 10

# What are the quantiles from the theoretical normal distribution?
index <- seq(1, N)
P <- (index - 0.5) / N
theoretical.quantity <- qnorm(P)

# Our sampled data:
yields <- c(86.2, 85.7, 71.9, 95.3, 77.1, 71.4, 68.9, 78.9, 86.9, 78.4)
mean.yield <- mean(yields)       # 80.0
sd.yield <- sd(yields)           # 8.35

# What are the quantiles for the sampled data?
yields.z <- (yields - mean.yield)/sd.yield
yields.z

yields.z.sorted <- sort(yields.z)

# Compare the values in text:
yields.z.sorted
theoretical.quantity

# Compare them graphically:
plot(theoretical.quantity, yields.z.sorted, asp=1)
abline(a=0, b=1)

# Built-in R function to do all the above for you:
qqnorm(yields)
qqline(yields)

# A better function: see http://learnche.mcmaster.ca/4C3/Software_tutorial/Extending_R_with_packages
library(car)
qqPlot(yields)
```

Code used to illustrate the central limit theorem's reduction in variance:

```# Show the 3 plots side by side
layout(matrix(c(1,2,3), 1, 3))

# Sample the population:
N <- 100
x <- rnorm(N, mean=80, sd=5)
mean(x)
sd(x)

# Plot the raw data
x.range <- range(x)
plot(x, ylim=x.range, main='Raw data')

# Subgroups of 2
subsize <- 2
x.2 <- numeric(N/subsize)
for (i in 1:(N/subsize))
{
x.2[i] <- mean(x[((i-1)*subsize+1):(i*subsize)])
}
plot(x.2, ylim=x.range, main='Subgroups of 2')

# Subgroups of 4
subsize <- 4
x.4 <- numeric(N/subsize)
for (i in 1:(N/subsize))
{
x.4[i] <- mean(x[((i-1)*subsize+1):(i*subsize)])
}
plot(x.4, ylim=x.range, main='Subgroups of 4')
```

Code to show how to superimpose plots

```data <- read.csv('http://openmv.net/file/raw-material-properties.csv')
summary(data)

# Single plot
plot(data\$density1)

# Connect the dots
plot(data\$density1, type='b')

# Another variable
plot(data\$density2, type='b', col="red")

# Superimpose them?
plot(data\$density1, type='b', col="blue")
lines(data\$density2, type='b', col="red")  # where's density2 ?

# Superimpose them: limits
plot(data\$density1, type='b', col="blue", ylim=c(10, 45))
lines(data\$density2, type='b', col="red")  # now density2 shows up
```

Code to show how to deal with missing values:

```data <- read.csv('http://openmv.net/file/raw-material-properties.csv')
summary(data)  # notice the NAs in the columns: these refer to missing value (Not Available)

sd(data\$density1)  # why NA as the answer?
help(sd)
sd(data\$density1, na.rm=TRUE)  # no NA answer anymore!