1.5 Data Subsetting and Manipulation

Bootcamp Programming R

Data Subsetting and Manipulation (Use sumo on data)

SCRIPT 1.5 Use this to follow along on this page and for the Practice exercises below.

Overview

With a good basic set of moves for subsetting and manipulating data, you can overpower any dataset no matter how large and powerful they may be. Then, you will have strong data Sumo.

Subsetting and manipuating data is probably the commonest activity for anyone who works with data. This is a core activity for exploratory data analysis, but is also extensively used in simple data aquisition, analysis and graphing, while also being related to more general data manipulating activities, for example database queries. This page is an introduction to the core syntax and some of the tools for manipulating and subsetting data in R.

Contents

1.5.1 Indexing concept

1.5.2 Using which() and subsetting

1.5.3 Selection on data.frame objects

1.5.4 Using aggregate()

1.5.5 Practice exercises

1.5.1 Indexing concept

If you would like to slice and dice your data, you will need to learn all about indexing!

The basics of the indexing concept in R syntax is very simple, where data storage objects like vectors (1 dimension), matrices (2 dimensions) and arrays (3 or more dimensions) store individual data values that can be accessed by the “address” of the dimension(s).

How indexing works

Say you have a numeric vector called my_vector that has 10 values. The index values will be 1 to 10, with each value corresponding consecutively to the data value at that position.

> my_vector
 [1] 11.3 11.2 10.4 10.4  8.7 10.8 10.5 10.3  9.7 11.2

Notice the [1] in the R console output? This indicates the index of value right next to it and the R system will provide an index value for longer vectors as the wrap in the console. If we could see the actual index values it would look something like this:

> my_vector
 [1] 11.3 11.2 10.4 10.4  8.7 10.8 10.5 10.3  9.7 11.2
#     ^    ^    ^    ^    ^    ^    ^    ^    ^    ^
#     1    2    3    4    5    6    7    8    9    10

Vectors

You can create vector subsets by manipulating the index. Vector objects have indices in 1 dimension. For example, my_vector[1:i], where i is the length of the vector.

## Vectors ####
# Try this

my_vector <- c(11.3, 11.2, 10.4, 10.4, 8.7, 10.8, 10.5, 10.3, 9.7, 11.2)

# Return all values
my_vector        # Typical way
my_vector[ ]     # Square brackets with blank index implies all index values
my_vector[ 1:10] # Returns all index values explicitly

# Return the first 3 values
1:3 # Reminder of the function of the colon operator ":"
my_vector[ 1:3] # Notice consecutive indices can use the ":" operator

# Return 5th and 9th values
my_vector[ c(5, 9)] # Notice we have to place non-consecutive index values in the c() function

Matrices

Matrix objects have 2 dimensions denoted as my_matrix[1:i, 1:j], where i is the number of rows and j is the number of columns.

## Matrices ####
# Try this

my_matrix <- matrix(data = c(2,3,4,5,6,6,6,6),
                    nrow = 2, byrow = T)

my_matrix # notice how the arguments arranged the data

# Flash challenge: make a matrix with the same data vector above to look like...
#      [,1] [,2]
# [1,]    2    6
# [2,]    3    6
# [3,]    4    6
# [4,]    5    6

# "Slicing" out a row or column
my_matrix[1,  ] # Slice out row 1
my_matric[ , 3] # Slice out column 3

# Matrix columns and rows often have names
names(my_matrix) # No names yet

nrow(my_matrix) # Returns number of rows (useful for large matrices)
rownames(my_matrix) # No row names; 2 rows, need two names

rownames(my_matrix) <- c("dogs", "cats")
my_matrix # Now the rows have names!
rownames(my_matrix) # Get them this way too!

# Flash challenge: Name the columns of my_matrix "a", "b", "c", "d" with colnames()

my_matrix

# Should look like this:
#      a b c d
# dogs 2 3 4 5
# cats 6 6 6 6

# You can also slice out matrix portions by name
my_matrix["dogs", c("b", "d")]

# Finally, functions act on values, not the index
mean(my_matrix["dogs", c("b", "d")])

Arrays

Arrays are data objects with more than 2 dimensions (well, technically a matrix with 2 dimensions is also an array, but let’s ignore that for now). Array dimensions are denoted as my_array[1:i, 1:j, 1:k], where i is the number of rows and j the columns and k the “depth” of i * j.

my_array

## Arrays ####
# Try this

# help(runif)
# help(round)
# Try it to see what it does... 
my_vec <- round(runif(n = 27, min = 0, max = 100), 0)
my_vec # See what we did there?

length(my_vec) # Just checking

my_array <- array(data = my_vec,
                  dim = c(3, 3, 3))
my_array

# Flash challenge: 
# Specify and print the 1st and 3rd  slice of the k dimension of my_array
# Assuming my_array has dimensions i, j, k like my_array[i,j,k]

1.5.2 Using which() and subsetting

Subsetting data objects is done by exploiting the index system. We usually do this by either specifying index values explicitly (effective, but it requires that you know A LOT about the data object), or by contructing queries that choose subsect of data based on particular values. The which() function is a powerful way to construct queries.

# Try this
help(which) # Notice how the x argument is required to be a LOGICAL vector?

# Make a NUMERIC vector
vector_a <- c(3, 4, 5, 4, 3, 4, 5, 6, 6, 7)

# Use a boolean phrase to ask which elements of vector_a are greater than 5
vector_a > 5 # Interesting... it is a LOGICAL vector!

# The which function will return the index values of TRUE values
# In other words, WHICH values in vector_a are greater than 5?
which(vector_a > 5)

# What is the point of all this?
# THE POINT is to be able to use expressions to obtain indices 
# and values in data structures...

# What VALUES in vector_a are > 5?
vector_a[which(vector_a > 5)]

# This also works on vectors of other types
# Consider a character vector
char_vec <- c("wheat", "maize", "wheat", "maize", "wheat", "wheat")

# Which elements are equivalent to "wheat"?
char_vec == "wheat"
which(char_vec == "wheat")

char_vec[ which(char_vec == "wheat")] # This works
char_vec[ char_vec == "wheat"]        # Same output

# Flash challenge: Explain in your own words why 
# the previous 2 lines of code have identical output?

We are just beginning to scratch the surface of possibilties with the which() function - keep it in mind and practice it when you can.

1.5.3 Selection on data.frame objects

Data frames are the ultimate data object for getting, storing, organizing and analyzing data. A good scientist must learn to communicate the subtlety of data. A good statistician must learn not to underestimate the subtletly of data. A good student must learn that subtlety may exist, even in simple data.

There are a couple of data object types that have a special characteristic in that they store data of different types, where vectors, matrices and arrays can only store one type of data (e.g., numeric, character, logical, etc.). The special data objects that can contain multiple data types are list objects, and data frames - here we will focus on data frames. Data frames can have different vector types arranged by column but there is a constraint that each vector must be the same length, that is, each ROW is considerend an observation for each variable value (though there may be missing data coded by NA).

There are a few ways to think about selecting values in a data frame. The first is simply to access values through the variable names, which can either be done by using the data frame name with the $ operator and the variable name, or by using the [ , ] syntax with either the variable name or the column number of the variable of interest see here.

A second powerful way to access variables in a data frame is by selecting particular rows of a data frame. This may be done by selecting the rows of a data based on values of one or more variables. We will practice doing this using which(), the [ , ] syntax, and boolean phrases is the following code block.

For the following section, we will use the OrchardSprays dataset that exists as a data frame in the in-built {datasets} package. You can use help(OrchardSprays) to see the help page (the help page is characteristically terse, so some description is given here.

The OrchardSprays experiment

This is a classic dataset based on an experiment looking at how a chemical additive could be used to deter honeybees from being attracted to crops and subsequently killed by pesticides.

The experiment involved having a treatment consisting of adding a “lime sulphur emulsion” (honeybee deterrent) in increasing concentrations to a sucrose solution. The treatment variable had 8 levels including a control (no deterrent) and 7 other levels with increasing concentration of the deterrent. The treatment levels were named A (the highest amount of deterrent), B (second highest deterrent) through to G (lowest deterrent) and H (control - no deterrent) The decrease variable was a measure of the quantity of sucrose solution that was taken by honeybees (the prediction here is that higher concentrations of the deterrent should result in a lower decrease in the sucrose solution).

The experiment involved a Latin Square design, with the order of the 8 treatments arranged randomly in an array of 8 columns (the purpose of this design is to randomize any effect of the treatment ORDER or POSITION on the response variable). This resulted in an 8 row by 8 column experiment. The response was measured after placing 100 honeybees into an experimental chamber with the 64 containers of sucrose solution.

## OrchardSprays ####
## Understand the data - an important step
# Try this

# Load the OrchardSpray data using the data() function
data(OrchardSprays) # Should see OrchardSprays <promise> in the Global Env.

# Look at the data head()
head(OrchardSprays) # First 6 rows

# Look at variable types with str()
help(str) # Good function to see info about data object
str(OrchardSprays)

# First let's just look at the data
# Don't worry too much about the code for these graphs if you have not encountered it before
boxplot(decrease ~ treatment, data = OrchardSprays, 
        main = "The pattern fits the prediction",
        ylab = "Amount of sucrose consumed",
        xlab = "Lime sulpher treatment amount in decreasing order (H = control)")

# This is the experimental design
# Latin Square is kind of like Sudoku
# No treatment can be in row or column more than once
plot(x = OrchardSprays$colpos,  # NB use of $ syntax to access data
     y = OrchardSprays$rowpos, 
     pch = as.character(OrchardSprays$treatment),
     xlim = c(0,9), ylim = c(0,9),
     main = "The Latin Square design of treatments",
     xlab = "\"Column\" position",
     ylab = "\"Row\" position")

Practice selecting parts a data frame

Selecting particular parts of a data frame based on the values of one variable is a common and extremely useful task.

## Practice selecting parts a data frame ####

# Select the rows of the dataset for treatment "D"
# (Pseudocode steps to solve) 
# Break it down to make small steps easy to read
# 01 Boolean phrase to identify rows where treatment value is "D"
# 02 which() to obtain index of TRUE in boolean vector
# 03 Exploit [ , ] syntax with data frame object to slice out rows

# 01 Boolean phrase
OrchardSprays$treatment # Just print variable to compare visually to boolean
OrchardSprays$treatment == "D" # logical vector - TRUE in "D" positions

# 02 which()
which(OrchardSprays$treatment == "D") # Index of TRUE values
my_selection <- which(OrchardSprays$treatment == "D") # Place index in a variable
my_selection # Just checking

# 03 Exploit [ , ] syntax with data frame object to slice out rows
OrchardSprays[my_selection, ]

# Flash challenge: Select and print all rows at "colpos" values of 2

Selection based on more than one variable value

Using the basic building blocks of boolean selection, more complex rules for selecting data can be made.

## Compound boolean for selection ####

# Select all rows of the data frame where 
# rowpos equals 4 OR 6 AND treatment equals "A" OR "H"
# What we expect is exactly 2 values (A or H) for each powpos (4 or 6)

# rowpos 4 and 6
OrchardSprays$rowpos == 4 # The 4s
OrchardSprays$rowpos == 6 # The 6s

OrchardSprays$rowpos == 4 | OrchardSprays$rowpos == 6 # All together

# now with which()
which(OrchardSprays$rowpos == 4) # The 4s
which(OrchardSprays$rowpos == 6) # The 6s

which(OrchardSprays$rowpos == 4 | OrchardSprays$rowpos == 6) # All together

# treatment A and H
which(OrchardSprays$treatment == "A" | OrchardSprays$treatment == "H") # All together

# Now we need the intersection of value that are in both our which() vectors

which((OrchardSprays$rowpos == 4 | OrchardSprays$rowpos == 6) &  # It works
        (OrchardSprays$treatment == "A" | OrchardSprays$treatment == "H") ) 
  
# NB this is a long way of spelling out our selection, 
# but trying to be very explicit with what is going on

my_selec2 <- which((OrchardSprays$rowpos == 4 | OrchardSprays$rowpos == 6) &  
                     (OrchardSprays$treatment == "A" | OrchardSprays$treatment == "H") ) 

OrchardSprays[my_selec2, ] # Double check it works and is similar to expectation...

# Flash challenge: Calculate the mean of decrease for treatment "A" 
# and the mean of decrease for treatment "H"

1.5.4 Using aggregate()

The aggregate() function

We often may wish to summarise parts of a data set according to some index of variable values. A very convenient tool for the is the aggregate() function, which we will practice here.

help(aggregate)

# A few important things to note about how this function works:
# The "x" argument is a data object you input, but should only contain numeric values usually
# If a data.frame object is input as x, a data.frame object is the output
# The "by" argument must be a list() object and can be one or more indices
# The FUN argument is the name of the function that will act on the "x" argument data

# Let's try a few examples

## 1 calculate the mean of decrease by treatment in OrchardSprays

aggregate(x = OrchardSprays$decrease,
          # NB use of list() and naming it "treatment"
          by = list(treatment = OrchardSprays$treatment), 
          FUN = mean)

# we can "recycle" the code above to apply different functions
# standard deviation with sd()
aggregate(x = OrchardSprays$decrease,
          # NB use of list() and naming it "treatment"
          by = list(treatment = OrchardSprays$treatment), 
          FUN = sd)

# Range with range()
aggregate(x = OrchardSprays$decrease,
          # NB use of list() and naming it "treatment"
          by = list(treatment = OrchardSprays$treatment), 
          FUN = range)

# What if we want several summary statistics?

aggregate(x = OrchardSprays$decrease,
          # NB use of list() and naming it "treatment"
          by = list(treatment = OrchardSprays$treatment), 
          # NB use of function() 
          FUN = function(x) c(mean = mean(x), # Add naming
                              sd = sd(x), 
                              range = range(x)))

## Example of use of aggregate object
# Say you would like to graph a barplot of the MEAN of decrease by treatment
# and you would like to show STANDARD DEVIATION error bars

# Make data frame with summary values using aggregate()
my_mean <- aggregate(x = OrchardSprays$decrease,
                        by = list(treatment = OrchardSprays$treatment), 
                        FUN = mean)

my_sd <- aggregate(x = OrchardSprays$decrease,
                   by = list(treatment = OrchardSprays$treatment), 
                   FUN = sd)

my_mean
my_sd

# Tidy things up in a new data frame using data.frame()
# Take care of naming variables for clarity
help(data.frame) # Continue using help() as a good habit
new_data <- data.frame(treatment = my_mean$treatment,
                       mean = my_mean$x,
                       sd = my_sd$x)
new_data # Looks good

# There is a lot going on in the following code
# The point is to show what is possible

(bar_centers <- barplot(new_data$mean,#   use mean for barheight
                        ylim = c(0, 115),
                        ylab = "Mean solution decrease (+- 1 SD)",
                        xlab = "Treatment"))
# NB bar_centers holds the numerical position value of the bars...

help(arrows)  # Use to draw error bars
arrows(x0 = bar_centers, 
       x1 = bar_centers,
       y0 = new_data$mean , # start error bar at top of the bar!
       y1 = new_data$mean + new_data$sd, # end error bar here!
       angle = 90,
       length = 0.1)

# Last step: label the x axis
axis(side = 1,
     at = bar_centers,
     labels = new_data$treatment)
     
# Flash challenge: Draw a new barplot by recycling the code above
# This time, add error bars showing on both the top and the bottom of the mean values

1.5.5 Practice exercises

For the following exercises, use the trees dataset, which has Girth, Height and Volume variables for 31 Black Cherry trees.

# Examine the data
help(trees)
data(trees)
str(trees)

1 Show code to calculate the mean Girth of Black Cherry trees with Height less than 75 ft.

2 Use help(cut) and then use the cut() function to create a new factor variable based on the Height numeric variable in the trees dataset. Try setting the breaks argument to 2 or 3. Rename the levels of your new factor to something meaningful. Show the code.

3 Using the new factor from question 2, use aggregate() to calculate the mean and standard deviation of all three variables in the trees data. Show your code and report the results to 2 decimal points of accuracy.

4 Show the code using which() and boolean phrases as appropriate to find the rows in the trees dataset where Girth is higher than 11 and Height is lower than 75.

5 Run the following code:

data_1 <- data.frame(volume = c(4,5,6,5,6,7,6,5,6,8,7,3,8,7,NA,10),
           population = c("A","A","A","A","A","A","A","A",
           "B","B","B","B","B","B","B","B"))

Use aggregate() to calculate the mean of volume for each population (hint: you may need to use help for the functions involved and pay close attention to your data frame…).

6 Write a plausible practice question involving any aspect of using which(), boolean phrases and/or aggregate() involving the in-built R dataset iris.