This guide is suitable for new R-users or advanced level R-users looking for information on specific topics. It was created using R version 3.2.2. You can find the most recent copy of R from the the R website: Download R
. If your computer already has R, make sure that it is R3.0.2 or higher in order to run some packages, especially Swirl, which we use later in the guide. The data used in most of the sections is a modified version of the 2014 Toronto Municipal Political Poll, and the two other data sets used are fabricated data sets created for the purpose of this guide.
Table of Contents
Learning R Online
Basics of R
When you first open R, it looks like this. The console window is how all commands are executed in R. As you enter commands in red, R will respond in blue. First, you should open the new script window, which is found under File and then New Script. For clarity as you work, next you should organize the windows, which can be done through Windows and then either Tile Horizontally or Tile Vertically.
Your screen should look like this, with your script editor on one side and the console on the other. Now, you are ready to begin using R!
Because R is a script based program (meaning that you have to type what you want it to do rather than clicking buttons), the editor window allows you to carefully write the script that you want and then enter it into the console window. Scripts can be saved so you can use them again and again. There are three options to run command lines or selections (you select by highlighting with your cursor) from the editor window:
Right click to bring up a menu. “Run line or selection” is the first option.
Ctrl + R will automatically run the line or selection
The Run line or selection button is the centre button beneath Edit and Packages (the button does not appear if you’re in the console window).
If you are writing your commands directly into the console window, you just have to hit enter to have R run them.
The basic format for commands is the combination of functions and information. Functions are a word or a few words, sometimes just abbreviations of words, followed by brackets, for example: read.table()
, or plot()
. The words give you a sense of what the command is supposed to do, and inside the brackets is where you place the useful information like which table you want read, what the x-range is, or what data you want plotted. Commands can be just one function or multiple functions strung together. Here are some examples:
This gives you the sum of 4 and 5.
This gives you the mean of the sums of the two different sets of numbers, thus combining the sum function and the mean function. Once you know more about R, you can even write your own functions. More on scripting will come later.
Commands are case-sensitive, i.e., load() will run and Load() will not
Always check spelling – some commands are pluralized and some are not, i.e., names() will run and name() will not
Check for spaces and misplaced symbols
R is a bare bones program. Its utility comes from the 3500+ packages available for installation. You may think of these packages as similar to extensions or apps, which some other programs use, like Google Chrome or Apple products. Each package serves a purpose and has specific commands you can use. Some are good for creating graphics; some are good for statistical analysis. Not to worry, you won’t need to download all of them. Packages can be found on the Comprehensive R Archive Network (CRAN), and there are many blog posts about some of the best R packages to use. The CRAN webpage for the package is important to find because it will, typically, list all of the commands associated with the package and their functions.
Using packages is a twostep process. First you must install the package you want, and then you must load it into R. To begin, you set your CRAN mirror, which is the second option under the Packages button on the menu bar. You want the CRAN mirror closest to your location because this will affect the time it takes to download. Also, not every package is available on all mirrors, so if you don’t see what you want, set your mirror to 0-Cloud [https] if you want to view every single package. Because packages are open source, you will want to regularly check for updates.
Packages -> Install Packages -> Select [package]
Packages -> Load package -> Select [package]
Check that your desired package is loaded into R
Packages are built on each other. When installing a new package, it should install the ones it depends on too. If a package is not working, double-check that its dependent packages are also installed.
Because packages are open source, be sure you check for updates frequently; this does not occur automatically (“Update packages” is an option under Packages).
When all else fails, there are, usually, comprehensive entries on CRAN on each package.
Learning R Online
Since you’re new (or even if you’re trying to re-familiarize yourself or learn new things), the best package to download is Swirl. Swirl is like a game which will take you through every aspect of R. There are many courses and lessons to choose from depending on your familiarity with the program and what you want to do with R. The lessons preloaded in Swirl from its R Programming course are:
Basic Building Blocks
Workspace and Files
Sequences of Numbers
Matrices and Data Frames
lapply and sapply
vapply and tapply
Looking at Data
Dates and Times
Going through those will give you a solid foundation for you to begin using R all on your own. More advanced courses made by Swirl are Data Analysis, Mathematical Biostatistics Boot Camp, Open Intro, Regression Models, Getting and Cleaning Data, and Statistical Inference. All of this can be found at the following weblink:
The remainder of this guide will walk you through R if you don’t want to use Swirl or if you want to quickly familiarize yourself with a function or process.
Basics of R
Beginning note: it is highly recommended that you have R open and try things out as you learn them in this guide. The data files used to create this guide can be found in the folder, so you can follow along.
The R environment can be used to compute calculations and assign variables. As a new R-user, you might want to practice these simple exercises by typing them into the console window:
There are different data types in R. These data types can be numeric, integer, logical/boolean, character/string, vector, matrix, array, list, data-frame etc. It is useful to know the data type in order to know what functions can be performed on the object. To determine the type of data, you can use the class()
function. The following commands create different variables and check their type using the class()
function. It is possible to convert from one data type to another by using functions such as as.integer()
As you can see, pi has been converted from a number to an integer to a vector to a matrix using various R functions.
Before you take a deep dive into learning R, here are some other basic tips, including some initial steps to take in case you encounter an error.
Environment: The R environment is current workspace. You can think of it like the memory of all assigned values, which you can draw on as you work. For example, if 33 is assigned to black (“black<-33”), then “black” now exists within the R environment with a value of 33, and if the environment is cleared, then “black” will no longer have that value. You can view all of the assigned values using the ls() function.
Functions: As previously described, functions perform various tasks in R, and as simple as they may seem on the surface, each function has a number of arguments and other details, which can be found on the CRAN website. Arguments are the most important because they tell you what information to add to improve the output of the function (type “?read.csv” into R for a good example, no quotations marks).
Printing: Print is R jargon for viewing something; it gets ‘printed’ to the workspace. In order to print a dataset, you only need to type its name. For example if 67 is assigned to blue (“blue<-67”), then blue must be printed to see its assigned value.
R does not automatically display the inputted information because you may be running multiple commands before you want to view the information; therefore, you must print if you want to immediately see the data.
There is no undo. Every time a command or script is run there is no way of undoing it. For this reason, you may choose to write your script in the editor window before running it.
Errors: when facing an error, always double-check your script for mistakes and that R is set to the correct working directory and all necessary libraries are loaded. If all of that is fine, there are several other options for troubleshooting: 1) search the error on Google and the R internet forums, 2) ensure that the functions you are using are the best ones for what you want to accomplish, and 3) that all of the arguments associated with the functions have been properly defined. If none of this solves your error, the best option is to provide a detailed description of what you were trying to do and post it onto an R forum.
Here are some operators you may encounter.
One of the most important operators in R because this is the equivalent of a help button. When used in combination with a function, such as “?read.csv”, a webpage on the CRAN will open explaining what the function does, what its arguments are, and what are some associated functions.
This operator assigns the value on the right side to that of the left, so “five<-5” means that the numerical value 5 is assigned to the character value of five. Therefore, inputting “five+5” would result in 10.
To access one variable in a dataset, use the dollar sign “$”. For example, ont14$vote1 returns the vote1 variable (the vote1 column).
All information put between quotation marks must be literal because R will search for those exact characters.
You might find this operator in at the start of a command line in CRAN files or any descriptions of R functions. This tells R not to run that line and to move on to the next, so it is a way to provide line-by-line commentary without interrupting R’s ability to run the script.
As seen in the vector example on the previous page, c(), which stands for concatenate, will combine its arguments, both numbers and words, into a vector.
Another form of concatenation used typically in writing long scripts which span multiple lines, so rather than R interpreting each new line as a new command, it reads it all as one single command.
A boolean (meaning there are two possible values, true or false) operator which ensures that the values on the left side are the exact same as the values on the right, i.e., 5==5 would be true and 5==9 would be false.
Setting and exploring directory
The directory is the place on your computer that is the home for R; therefore, this is where R is saving files and also where R is looking for files. Because R might be using a folder buried deep in your computer’s hard drive, there are two ways of finding and setting your working directory. First, you can use getwd() to find the current directory and setwd() to set the directory to a different path. NOTE: need to use the forward (/) instead of the backward slash (\) for directory paths in R.
The second way is by going under File on the menu bar and going down to Change dir. This way is best if you do not know the exact name and location of where you want to set your new working directory because it allows you to go through all of the files on your computer.
Reading CSV Files
Use read.csv() to read csv files. Set the header option to true if the file has column titles and false otherwise. The default option is always true.
csv <- read.csv("ont14.csv", header=TRUE)
Reading Excel Files
There are different methods to read excel worksheets. Method 1 involves copying the data set from excel and using read.table() to read the data set that is in clipboard. This is the best option for R versions 3.0.2 or older. Methods 2 and 3 require R version 3.0.3 or later.
excel1 <- read.table("clipboard")
excel2 <- read.xlsx("excelfile.xlsx", sheetName="Sheet1", header=FALSE)
workbook <- loadWorkbook("excelfile.xlsx")
excel3 <- readWorksheet(workbook, sheet="Sheet1", header=TRUE)
Reading Fixed Files
Use the read.fwf() function from the gdata library to read fixed format files. Include the width of every variable in the option.
fixedfile <- read.fwf("ontfixed.fix", widths=c(14, 24, 2, 15, 2, 2, 10))
* Data: ontfixed.fix
Reading Stata Files
Use the read.dta() function from the foreign library to read stata data sets. To read Stata 13 data sets, use read.dta13() function from the readstata13 library.
dates <- read.dta("dates.dta")
* Data: dates.dta
Reading Other Types
The read.table() function is useful to read data sets that are in table format and create a data frame. It has options for header, delimiter (sep), skip (lines to skip before reading data) etc. The scan() function is also available, and it reads data into a vector or a list from a file.
Entering Data Manually
To enter data manually, create a vector of data for each variable or observation. Combine the vectors as columns of variables using cbind() or rows of observations using rbind() into a matrix. The data can be changed into a data frame using the as.data.frame() function or kept as a matrix. Note that we create a vector for variable here.
name = c("John", "Xu", "Aisha")
age = c(10, 15, 24)
gender = c("male", "female", "female")
matrixdata <- cbind(name, age, gender)
Gives us this:
Now that the information has been combined in a more easily interpreted way, it could be helpful to change it into a data frame. What you want to do determines whether you leave it as a matrix or change it to a data frame. This is how to change to data frame:
data <- as.data.frame(matrixdata)
As you can see, a data frame is a visually cleaner way of working with the information.
Check that the file you want is in your working directory; if it is not, either move the file to that folder or change your working directory
Double-check the arguments associated with the function you are using to load the file; you may need to provide additional information in order to help R load the file
Ensure that you are using the correct R function to load that file type, i.e., if you are loading Stata data, make sure you are using the read.dta() function
To begin, open the corresponding data files for this section. You can do this using the read.csv() method discussed above; you will also want to create a name for this data file in R.
ont14<- read.csv("ont14.csv", header=TRUE)
Unless you opened the .csv file beforehand, you don’t know much about the information you just loaded into R. To find out more, use the dim() function to find out the dimension of a data set. To view to contents of a data set, use the ls.str() function. The class() function returns the data type. For example, our data set ont14 is a data frame. Alternatively, the str() function does all three.
There are a number of useful functions when it comes to exploring data. Here are some other commands you can use to get a better sense of the data you are working with, whether it’s the file you just loaded or anything in the future (for a better understanding of each function, test them out on the ont14 file you just opened).
ls() lists all of the objects available in the workspace, i.e., all the data and variables you have defined
str() displays the structure of an object in a compact way
ls.str() combines the above functions to lists all objects with details about data type and content
summary(), similar to the previous functions, lists the structure and summarizes the variables rather than displaying all of them
class() tells you the data type, i.e., vector, matrix, character, etc.
names() can either change the name of an object or tells you all of the defined names in the object, i.e., all of the column titles in an excel file
object.size() tells you how much memory is taken up on your computer by the object
dim() tells you the dimensions of the object
length() tells you the length (number) of the vectors and factors in the object
ncol() tells you the number of columns
nrow() tells you the number of rows
head() tells you the first six lines of a vector, matrix, table, data frame, or function
tail()tells you the last six lines of a vector, matrix, table, data frame, or function
If you tried all of those on the on14 dataset, you will notice that some of the functions produce overlapping information, so using each of these would be highly repetitive. However, some of these functions can be used to modify the data, so that is where the true utility of the functions comes into play. For more detailed information, you can always check the help page, which is done by placing a ‘?’ before the name of the function.
To open or, in R terminology, print the content of a data set or a variable in the R-console, simply write the data set or the variable. A data set has 2 dimensions. The first dimension is the row number and the second dimension is the column number. The two dimensions are separated by a comma. For example, ont14[1,2] prints the value in the first row and second column of the ont14 data set, ont14[1, ] prints the first row and ont14[,2] print the second column. As you may notice, we use square brackets when isolating data and round brackets when working with functions. To print a range of rows or a range of columns, indicate the range separated by a colon.
Remember to access one variable, use the dollar sign “$”. For example, ont14$vote1 returns the vote1 variable. The attach() function makes the variables of a data set available in the workspace. Therefore, one can access the variable vote1 by simply writing it instead of using the dollar sign. To release the variables from the workspace, use the detach() function.
vote1[1:3] #The first votes of first three subjects
detach(ont14) #releases the names
vote1[1:3] #Variable not available
The summary() gives a summary of the object. If the object is a data set, it gives a summary of all the variables in a data set and if it is a variable, it gives a summary of the variable. Other descriptions can be obtained using the fivenum(), min(), mean(), max(), var(), quantile() functions.
Note: always remember to use the attach() function at the start of each section to ensure that the variables are in the workspace.
There are different methods to obtain frequency tables. Remember to install and load a package before using a function from a specific external library. Every table displays the information differently; therefore, each has a different time and place for which it is best suited. Try each out and think about which you liked best and what the positives and negatives are to each table.
The following codes can be used to make bar charts, pie charts, boxplots and scatterplots. These are just a few of the many data visualizations you can produce using R. Each example shows you how to add more information to better develop the data visualization, so the below images are made using the last code in the entry.
barplot(table(income), col="lightblue", main="Income Distribution", xlab="Income", ylab="Frequency")
pie(table(vote1), main="First Vote distribution")
pie(table(vote1), main="First Vote distribution", col=rainbow(6))
boxplot(income~fordapp, col="cornflowerblue", main="Income boxplots for each approval category of Rob Ford", xlab="Approval Category of Rob Ford", ylab="Income", notch=T, label=c("No", "Yes"))
plot(ageavg, income, bg="lightblue")
plot(ageavg, income, pch=1, cin=0.01)
plot(jitter(ageavg, 20), jitter(income, 20))
plot(jitter(ageavg, 20), jitter(income, 20), pch=18, col="red")
plot(jitter(ageavg, 20), jitter(income, 20), pch=18, col="red", main="Scatterplot of Average Age by Income", xlab="Income", ylab="Average Age")
Dropping & Adding Variables
This ability is useful when the dataset includes information you do not need or if you are trying to reorder the variables. The functions ls() and ls.str() list the objects that are available in the workspace. To remove a variable from the workspace, use rm() or the equivalent remove() function.
To remove a variable from the workspace:
rm(gender2) #or remove() can also remove datasets
To remove all objects from the workspace:
To remove a variable from a data frame, use the methods below. Method 1 excludes the variable in question in the creation of a new data set. Method 2 deletes the variable in the same data frame. As you can see, there are now 28 variables in the new data sets instead of 29.
Method 1: Exclude
Method 2: Delete
To remove an existing label, use the factor() function to replace the label:
vote2010a<-factor(vote2010, levels=c("Yes", "No"), labels=c("Voted in 2010 Municipal Election", "Did not vote in 2010 Municipal Election"))
vote2010b<-factor(vote2010a, levels=c("Voted in 2010 Municipal Election", "Did not vote in 2010 Municipal Election"), labels=c("Yes", "No"))
To add a variable to a data frame:
For clarification, you must first assign information to the variable (in this case, the numbers 1 through to 889) and then add that variable to the existing dataset. If you do not create the variable first, then this command script will return an error. There are other methods of adding variables to a data frame, but this is the simplest.
To order data, create a new data frame with the order of choice. This method can also be used to select variables as well.
neworder <- as.data.frame(cbind(vote1, vote2, votenot, fordapp, voteford,
fordleave, fordres, votefordafterrehab))
The following shows how to label variable values in examples 1 and 2 and variables in example 3. To label the values of a categorical variable, use the factor() function. To label variables or data sets, use the label() function from the Hmisc package.
fordapp2<-factor(fordapp, levels=c(0,1), labels=c("No", "Yes"))
label(fordapp2) <- "Approval of Rob Ford"
vote2010a<-factor(vote2010, levels=c("Yes","No"), labels=c("Voted in 2010 Municipal Election", "Did not vote in 2010 Municipal Election"))
partyfactor <-factor(party2011, levels=c(1, 2, 3, 4, 5), labels=c("Progressive Conservative", "Liberal", "NDP", "Green", "Other parties"))
The following examples show how to recode from factor to numeric, from numeric to factor and between numeric variables. To recode, use the recode() function from the car package. Now, you may encounter an interesting problem. Because recode() is also a function in the Hmisc package previously used in the Labeling Variables section, it may not do what it is intended for in this section, so in this case, recode() will accomplish the same function in the car package. This is a good reminder to always look up a new function to learn both the arguments and idiosyncrasies of it.
votefordnum <- recode(voteford, '"Don\'t know"=2; "No"=0; "Yes"=1;', as.factor.result=FALSE)
#Use backslash to escape the quotation marks
gender2 <- recode(gender, '1="Male"; 2="Female";', as.factor.result=TRUE)
female <- recode(gender,'1=0; 2=1;', as.factor.result=FALSE)
Generating New Variables
To generate a new variable that is a combination of other variables, assign the combination to a new variable name. For example, below, totalapp is the average of candidate approval rates. In the code below, the function of average is assigned to “totalapp”. This variable has a been recategorized as an ordinal categorical variable “votepnum” and a nominal categorical variable “votep” using the recode() function. The standardized values for totalapp can be found using the scale() function or by calculating it. Note that the summary of the standardized values using the two methods, ztotalapp() and ztotalapp2() are the same.
totalapp <- (as.numeric(fordapp) + as.numeric(chowapp) + as.numeric(toryapp) + as.numeric(stinapp) + as.numeric(sokapp))/5
votepnum <- recode(totalapp,'0:0.33=0; 0.331:0.66=1; 0.661:1=2', as.factor.result=TRUE)
label(votepnum) <-"Voting Personality (numeric)"
votep<- recode(totalapp,'0:0.33="Strict"; 0.331:0.66="Average"; 0.661:1="Lenient"', as.factor.result=TRUE)
label(votep) <-"Voting Personality"
ztotalapp <- scale(totalapp, center=TRUE, scale=TRUE)
ztotalapp2 <-(totalapp - mean(totalapp, na.rm=TRUE))/sd(totalapp, na.rm=TRUE)
The by() function can be used to view the average candidate approval rates by another category. For example, below, we can see the average approval rates by income category.
by(ont14$totalapp, ont14$income, function(object) mean(x=object, na.rm=TRUE))
activity<-(as.numeric(fordapp) + as.numeric(chowapp) + as.numeric(toryapp)
+ as.numeric(stinapp) + as.numeric(sokapp) + as.numeric(chowheard) + as.numeric(toryheard)
+ as.numeric(stinheard) + as.numeric(sokheard))/9
Generating ID Variable
To subset, use the subset() function. We want to subset, for example, the female voters with the highest income on one hand and the male voters with the lowest education on the other hand. These data sets are saved as subset1 and subset2 respectively.
subset1<-subset(ont14, income==7 & gender==2)
subset2<-subset(ont14, education==4 & gender==1)
To append use the rbind() function that stacks similar data types. The number and order of variables between the two data sets must match in order for the rbind() function to be effective. Below, we append subset1 and subset2 to create the new data frame appenddata.
appenddata<- rbind(subset1, subset2)
Keeping & Dropping Variables
Managing data can involve keeping or dropping specific variables. For example, in example 3 below, only ford related variables are used to create a new data frame “subset3”. In example 4, we drop the variables in the first columns until column “votefordafterrehab” in order to keep the variables that provide sample information in “subset4”.
subset3<- as.data.frame(cbind(id, fordapp, voteford, fordleave, fordres, votefordafterrehab))
Because R does not include an undo ability, if you wish to combine data you have previously subsetted or specific data you want in a combined data frame, the merge() function will accomplish this.
mergedata<-merge(subset3, subset4, by="id")
If you need to reshape your data, you can either edit it in your original format (be it SPSS, Stata, Excel, etc.) or, if you have already worked on it in R and do not want to redo that work, then there are ways to reshape datasets using R. The reshape package offers a variety of other ways of massaging data, which are not covered in this section, including the t(), melt(), and cast() functions.
wide <- read.csv("wide.csv", header=T)
Wide to Long
long<-reshape(wide, varying=c("drug1", "drug2", "drug3"), v.names="drug_yn", + timevar="drugtype", times=c("1", "2", "3"), new.row.names = 1:12, direction="long")
by(long$drug_yn, long$drugtype, function(object) mean(x=object))
by(long$drug_yn, long$drugtype, function(object) sd(x=object))
cbind(as.vector(by(longsort, list(longsort$drugtype), function(x) mean(longsort$drug_yn))),
as.vector(by(longsort, list(longsort$drugtype), function(x) sd(longsort$drug_yn))))
Long to Wide
wide2<-reshape(longsort, timevar="drugtype", idvar=c("id", "age", "score1", "score2"), direction="wide")
Highest and Lowest Values in a Group
Here, we want to find the voters with the lowest and highest education level for each city by age group. First, we sort by all three variables (city, age and, education) from lowest to highest. In this case, we also sort by id because some groups with the same city, age and education values have more than one subject in them. Then, we use the aggregate() function to find the first individual with the highest and lowest education level in each city by age group.
Generating ID variable
Sorting the groups
ont14o<-ont14[order(ont14$city, ont14$age, ont14$education, ont14$id),]
Creating variables sorted by id
Separating out the lowest and highest list in each group
lowest<-aggregate(ont14o, by=list(ont14o$age, ont14o$city), head, 1)
highest<-aggregate(ont14o, by=list(ont14o$age, ont14o$city), tail, 1)
Checking for duplicates, this will return TRUE for no duplicates
length(unique(ont14$id)) == nrow(ont14)
Voilà, here are the relevant variables.
lowest[1:10 ,c(1:5, 28,32,33)]
highest[1:10 ,c(1:5, 28,32,33)]
In order to work with the dataset for this section, you will need to use the foreign package, which reads and writes data created by some versions of Epi Info, Minitab, S, SAS, SPSS, Stata, Systat, and Weka.
ls.str(dates) #which one are numeric vs char
Time in R is counted from January 1, 1970. This means that the number of seconds, hours, days, months, etc. following that date will be expressed as a positive numbers and that the same information preceding that date will be expressed as a negative number. Since R’s default is to previous dates as a character-type number, some modifications will need to be made to modify it to present in the day/month/year format. The following data consists of two dates: one is in string format and the other one is numeric as can be seen with the ls.str() function, as done above.
The lubridate package provides functions to identify date-time data, extract components, perform accurate math on date-time variables, handle time zones etc. Below, we use the mdy() and ymd() functions to convert character-type dates that are in month-date-year and year-month-date formats. Since the variables dmon, dday and dyear are numeric and come separately, to create a date variable, we concatenate the three variable with the paste() function before converting it into a date format with the ymd() function.
Converting into elapsed days since
event<-ymd(paste(dyear, "-", dmon, "-", dday, sep=""))
The format() function can be used to control the display of elapsed dates.
fbirthday<-format(birthday, "%d %b %Y")
fevent<-format(event, "%d %b %Y")
Since the third birthday is from year 72CE, we can assume that this is a mistake and requires fixing. This can be done by manually replacing the value of birthday for the 3rd observation.
fbirthday<-"21 Jan 1972"
The number of days elapsed can be converted into weeks, months and years as follows:
weeks <- diff_days/7
months <- diff_days/30.5
years <- diff_days/365.25
as.data.frame(cbind(fbirthday, fevent, diff_days, weeks, months, years))
Note: R has an original function called as.Date([date], [format]) that can be used to convert character-type dates to actual dates. However, the lubridate package is more popular because of its ease of usage.