Data Frames (Like your room, data should be tidy)
SCRIPT 1.4 Use this to follow along on this page and for the Practice exercises below.
NB for this page we assume you have access to Microsoft Excel; however, similar spreadsheet software (like Libre Office Calc) will work fine.
Overview
The first step for getting your data into R is making your data tidy. The first step in using R for data analysis is getting your data into R.
The commonest question we have experienced for new users of R who want to perform analysis on their data is how to get data into R. There is good news and bad news. The good news is that it is exceedingly easy to get data into R for analysis, in almost any format. The bad news is that a step most new users find very difficult is taking responsibility for their own data, in terms of basic aspects of data management, like choosing appropriate variable names, error checking, providing information about variables and data collection, proliferation of excessive dataset versions, and, worst of all, embedding superfluous information into Excel spreadsheets with valuable data.
A concept to streamline data preparation for analysis is Tidy Data. The basic idea is to format data for analysis in a way that 1) archives data for reproducibility of results, and 2) makes the data transparent to colleagues or researchers by documenting a data dictionary. This page is all about the tidy data concept and a simple recipe for best practice to prepare data for analysis and to get data into R.
Contents
1.4.3 Excel, data setup, and the Data Dictionary
1.4.5 Manipulating variables in the Data Frame
1.4.1 Tidy data concept
The tidy Data concept is generally attributed to Wickham (2014), and is based on the idea that with a few simple rules, data can be archived for complete reproducibility of results. This practice benefits any user, because it facilitates collaboration at the same time as documenting both data and analysis methods for value to future use.
The essentials of Tidy Data are:
-
Each variable should be in a column
-
Each observation should be in a row
-
A Data Dictionary should be assocatied with the dataset, such that completely reproducible analysis is possible
1.4.2 Common data file types
The best file type for the majority of people to archive data for analysis is in a plain text Comma Separated Values (CSV or .csv) file, or just an Excel Spreadsheet. Proprietary data formats should be avoided, like those produced by SPSS, Genstat, Minitab or other programs. The reason for this is that data stored in those formats is not necessarily useful to people who do not have access to the software, and that for archiving purposes, such software file formats tend to change over time. While Excel is a proprietary format, we find that it is is easy to use, (almost completely) ubiquitous, and relatively resilient to backwards compatibility issues. Thus, sticking to CSV or Excel is a rule you should have a very good reason if you choose to break from it.
We recommend using Excel to store data with the goal (for simple datasets and analyses) of having one table for the actual data, in Tidy Data format, and a second tab consisting of a Data Dictionary where each variable is described in enough detail to completely reproduce any analysis. Generally, no formatting or results should ever be embedded in an Excel sreadsheet that is used to store data.
1.4.3 Excel, data setup, and the Data Dictionary
Tidy Data and Excel
For this section, you should download the following files in Excel (.xlsx) format:
The following account describes an experiment.
You are contacted by someone who wants help with data analysis and they give you some information about their experiment. They are interested in how diet affects the production of an important metabolite in pest aphids. They designed an experiment with a control treatment where aphids were allowed to feed on plain plants, another treatment where their diet was supplemented with one additive, “AD”, and a third treatment where their diet was supplemented with two additives, “AD” and “SA”. Another factor was aphid Genus, where individuals from the genera Brevicoryne and Myzus were tested. Three replicates of each treatment combination were performed (i.e. aphid genus [2 levels] X food treatment [3 levels including the control]).
The metabolite of interest was measured with a spectrometer using three individual aphids from each replicate. The spectrometer peak area (an arbitrary scale) represents the total amount of the metabolite, which was converted to a real scale of metabolite total concentration. Finally, this total concentration was divided by 3 to estimate the concentration of the metabolite in each individual aphid.
Have a look at the file 1.4-untidy.xlsx in Excel.
The dataset is fairly small and it is readable by humans, but in its current form it is not useable for analysis in R or other statistical software and there are a few ambiguous aspects which we will explore and try to improve.
Untidy
-
The file contains embedded figures and summary tables
-
There is empty white space in the file (Row 1 and Column A)
-
The variable names violate several naming conventions (spaces, special characters)
-
Missing data is coded incorrectly (Row 13 was a failed reading, but records zeros for the actual measurements)
-
Conversion information accessory to the data is present (Row 3)
-
There is no Data Dictionary (i.e. explanation of the variables)
-
The Aphid and Diet treatments are “confounded” in their coding
-
What the heck is the “RT” column (most of the values are identical)
Now, have a look at the Tidy Data version of the data file.
-
The embedded figures have been removed
-
The white space rows and columns have been removed
-
The variable names have been edited but still are equally informative
-
Missing data is coded correctly with “NA”
-
The conversion info has been removed and placed in the Data Dictionary
-
A complete Data Dictionary on a new tab (“dictionary”) was added, explaining each variable
-
The aphid and food treatment variables were made separate
Tidy Data version Data Dictionary tab
Notice in the Data Dictionary how there is a row for each variable with the name of the variable and an explanation for each variable.
Once your data is tidy, it is very easy to read in Excel data files, or they can be exported into a text file format like CSV to read straight into R or other programs.
CSV
Have a look at the Tidy Data dataset in .csv file format - open it with a plain text editor (e.g. Notepad in Windows, or similar). You will notice that each column entry is separated from others with a comma ,
, hence the name Comma Separated Values!
Tidy csv
1.4.4 Getting data into R
We still need to actually read data into R from external files. There are a very large number of ways to do this and most people enventually find the own workflow. For beginners, we think it is best to begin with Excel or CSV files in Tidy Data format.
The basics of reading external files from a script is to to use the read.xlsx()
function in the {openxlsx}
package (you will oprobably need to install this with the install.packages()
function), or else to use read.csv()
that comes standard in base R. We will briefly try both.
Best practice when working with files is to formally set your “working directory”. Basically, this tells R where your input (i.e. data) and output (like scripts or figures) files should be.
Working directory
There are several viable ways to set your working directory in R, e.g. via the Session menu:
However, the best way to do this this is to set your working directory using code with the setwd()
function. Here we should a workflow for Windows, which is similar on other computer systems. We consider the step of setting a working directory essential for best practice.
If you are unfamiliar with how to obtain the path to your working directory, open windows explorer, navigate to the folder you wish to save your script, data files and other inpouts and outputs. You can think of this folder as one that contains all related files for e.g. a data analysis project, or perhaps this bootcamp module!
Your directory window might look similar to this
Notice the folder “view” is set to “Details”, and also notice that the folder options are set to “Show file extensions. We recommend setting your own settings like this (if using Windows Explorer).
The pointer is indicated in the circle marked A in the picture above.
Left click the area to the right of the folder text once (where the pointer is in the picture above) and you should see something similar to the figure below, where the folder path is displayed and the text is automatically selected.
Your selected file path might look similar to this
Assuming you have opened the File Explorer in your working directory or navigated there, the selected path is the working directory path which you can copy (Ctrl + c
in Windows). I your script, you can now use getwd()
to get and priont your working directory path, and setwd()
, which takes a single character string of the path for your working directory for the argument dir
, to set it.
R file paths use the forward slash symbol “/” to separate file names. A very important step for Windows users when setting the working directory in R is to change the Windows default “" for forward slashes…
You need this for the following code: .xlsx data file
# Try this
getwd() # Prints working directory in Console
setwd("D:/Dropbox/git/DSgarage/static/scripts")
# NB the quotes
# NB the use of "/"
# NB this is MY directory - change the string to YOUR directory :)
getwd() # Check that change worked
## Read in Excel data file
install.packages(openxlsx, dep = T) # Run if needed
library(openxlsx) # Load package needed to read Excel files
# Make sure the data file "1.4-tidy.xlsx" is in your working directory
my_data <- read.xlsx("1.4-tidy.xlsx")
All being well, you should see the following data object in your Global Environement. Note the small blue button (A, circled below) you can press to espand the view of the variables in your data frame.
Note that the same procedure works with Comma Separated Values data files, and other kinds of files that you want to read into R, except that the R function used will be specific to the file type. E.g., read.csv()
for CSV files, read.delim
for TAB delimited files, or read.table()
as a generic function to tailor to many types of plain text data files (there are many others, but this is enough for now).
1.4.5 Manipulating variables in the Data Frame
Now that there is a data frame in your working environment, we can start working with the variables. This is a good time to think about the “R Space” metaphor. You are floating in R Space and you can see a data frame called “my_data”. You cannot see inside the container, so we will look at methods of accessing the data inside by name…
Some important topics:
-
The
names()
function -
The use of the
$
operator for data frames -
The use of the
str()
function for data frames -
The use of the index operator
[ , ]
-
The use of the
attach()
function
# Try this
class(my_data) # data.frame, a generic class for holding data
## The ```names()``` function ####
# The names() function returns the name of attributes in R objects
# When used on a data frame it returns the names of the variables
names(my_data)
# Note the conc.ind variable is classed numeric
# Note the treatment variable is classed as character (not a factor)
## The use of the ```$``` operator for data frames ####
# The $ operator allows us to access variable names inside R objects
# Use it like: data_object$variable_name
my_data$conc.ind
# Try this
conc.ind # Gives an error because the variable conc.ind is INSIDE my_data
## The use of the ```str()``` function for data frames ####
# The str() function returns the STRUCTURE of a data frame
# This includes variable names, classes, and the first few values
str(my_data) # This is similar to the graphical Global Environment view in RStudio
## The use of the index operator ```[ , ]``` ####
# The index operator allows us to access specified rows and colums in data frames
# (this works exactly the same in matrices and otehr indexed objects)
my_data$conc.tot # The conc.tot variable with $
my_data$conc.tot[1:6] # each variable is a vector - 1st 6 values
help(dim)
dim(my_data) # my_data has 18 rows, 6 columns
my_data[ , ] # Leaving the entries blank means return all rows and columns
names(my_data) # Note conc.tot is the 6th variable
names(my_data)[6] # Returns the name of the 6th variable
my_data[ , 6] # Returns all rows of the 6th variable in my_data
# We can explicitly specify all rows (there are 18 remember)
my_data[1:18 , 6] # ALSO returns all rows of the 6th variable in my_data
# We can specify the variable names with a character
my_data[ , "conc.tot"]
my_data[ , "conc.ind"]
# Specify more than 1 by name with c() in the column slot of [ , ]
my_data[ , c("conc.tot", "conc.ind")]
## The use of the ```attach()``` function ####
# The attach() function makes variable names available
# for a data object in R space
conc.ind # Error; the Passive-Aggressive Butler doesn't understand...
attach(my_data)
conc.ind # Now that my_data is "attached", the Butler can find variables inside
help(detach) # Undo attach()
detach(my_data)
conc.ind # Is Sir feeling well, Sir?
1.4.6 Practice exercises
1 Download the data file above and place it in a working directory. Set your working directory. Read in the data file and place it in a data frame object named “data1”. After examining the data, use mean()
to calculate the mean of the variable “length” and report the results in a comment to 2 decimal points accuracy. Show your R code.
2 Show the code to convert the diet variable to an ordinal factor with the order “control” > “enhanced”, and the sex variable to a plain categorical factor.
3 Show code for 2 different variations of using only the “[ , ]” operator with your data frame to show the following output:
diet length
8 control 6
9 control 7
10 control 6
11 enhanced 8
12 enhanced 7
13 enhanced 9
4 Show code to read in a comma separated values data file that does not have a header (first row containing variable names).
5 Describe in your own words what the attach() function does.
6 Write a plausible practice question involving any aspect of manipuation of a data frame.