Introduction to R

This guide is suitable for new R-users or advanced level R-users looking for information on specific topics. The topics covered in this guide are importing, exploring, modifying and managing data.

The main dataset used is the flights dataset. It contains the US domestic flights in January 2020 [1]. The other two datasets used are fabricated datasets created for the purpose of this guide.
The link to the recording of this workshop can be found here. Additional resources are listed below. If you need assistance, fill out the support request form.

Importing Data
Exploring Data
New Variables
Managing Data

Importing Data

Working directory
The directory is the place on your computer that is the home for R; therefore, this is where R is saving files and also where R is looking for files. Because R might be using a folder buried deep in your computer’s hard drive, there are two ways of finding and setting your working directory. First, you can use getwd() to find the current directory and setwd() to set the directory to a different path. NOTE: need to use the forward (/) instead of the backward slash (\) for directory paths in R.
The second way is by going under File on the menu bar and going down to Change dir. This way is best if you do not know the exact name and location of where you want to set your new working directory because it allows you to go through all of the files on your computer.  
Importing Data

DATA: flights.csv

Use read.csv() to read csv files. Set the header option to true if the file has column titles and false otherwise. The default option is always true.
flights<-read.csv("January 2020 Flights.csv")


Exploring Data

Unless you opened the .csv file beforehand, you don’t know much about the information you just loaded into R. To find out more, use the dim() function to find out the dimension of a data set. To view to contents of a dataset, use the str()  function. The head() function displays the first six rows of the dataset and tail() displays the last six rows.
str(flights) result

head(flights) result

tail(flights) result


To open or, in R terminology, print the content of a data set or a variable in the R-console, simply write the data set or the variable. A data set has 2 dimensions. The first dimension is the row number and the second dimension is the column number. The two dimensions are separated by a comma. For example, ont14[1,2] prints the value in the first row and second column of the ont14 data set, ont14[1, ] prints the first row and ont14[,2] prints the second column. As you may notice, we use square brackets when isolating data and round brackets when working with functions. To print a range of rows or a range of columns, indicate the range separated by a colon.
flights[1, 1:5]
flights[1000:1005,c("originstate", "deptime", "depdelay")]
example code results
When we type a variable by itself, it gives us an error message. To access a variable, use the dollar sign “$”. For example, flights$originstate returns the originstate variable.
Error for originstate
Variable data
Frequency Tables
The table() function can be used to make a frequency table. You will also find functions from external packages that can be used to make frequency tables. 
table(flights$depdelay, exclude=NULL)
frequency tables
You can use the CrossTable() function from the gmodels package to make a two-way frequency table. First, you need to install the gmodels package using the install.packages() function and load it using the library() function.
CrossTable(flights$depdelay, flights$arrdelay)
cross table


Descriptive Statistics
The summary() gives a summary of the object. If the object is a data set, it gives a summary of all the variables in a data set and if it is a variable, it gives a summary of the variable. Other descriptions can be obtained using the fivenum(), min(), mean(), max(), var(), quantile() functions.
descriptive statistics
The by() function can be used to view the average candidate approval rates by another category. For example, below, we can see the average approval rates by income category. 
by(flights$distance, flights$dayofweek, summary)
by function


The following codes can be used to make bar charts, pie charts, boxplots and scatterplots. These are just a few of the many data visualizations you can produce using R. Each example shows you how to add more information to better develop the data visualization, so the below images are made using the last code in the entry.
Bar chart
# Bar chart
dayofweektable <- table(flights$dayofweek)
barplot(dayofweektable, main="Frequency of Flights by Day of the Week",  xlab="Day of Week", ylab="Frequency")


hist(flights$deptime, main="Histogram of Departure Time",  xlab="Departure Time", col="lightblue")

plot(flights$dayofweek, flights$deptime)
plot(flights$dayofweek, flights$deptime, pch=3, cex=3, col="darkred")


New Variables

To generate a new variable that is a combination of other variables, assign the combination to a new variable name.
# Example 1
flights$distancemiles <- flights$distance*0.621
New variables example 1

# Example 2
flights$instate[flights$originstate==flights$deststate] <- 1
flights$instate[flights$originstate!=flights$deststate] <- 0
New variables example 2

# Example 3
flights$delay <- ""
flights$delay[flights$depdelay==0 & flights$arrdelay==0] <- "not delayed"
flights$delay[flights$depdelay==1 & flights$arrdelay==0] <- "delayed at departure"
flights$delay[flights$depdelay==0 & flights$arrdelay==1] <- "delayed at arrival"
flights$delay[flights$depdelay==1 & flights$arrdelay==1] <- "delayed at both"
flights$delay <- factor(flights$delay,
                            levels = c("","not delayed", "delayed at departure", "delayed at arrival", "delayed at both"),
New variables example 3

Save() can be used to save a dataset as a native R .RData data file.
save(flights, file="Flights 2020.RData")


Managing Data

Subsetting Data
To subset, use the subset() function.
Example 1

departure<-subset(flights, select=c(originstate, origin, deptime, depdelay, deststate))
managing data example 1

Example 2
hawaii<-subset(flights, flights$deststate=="Hawaii" & flights$dayofmonth==1)
managing data example 2
Merging Data
To combine two datasets, you can use the merge() function.

DATA: airlinecodes.csv

airlinecodes <- read.csv("airlinecodes.csv")
flightsmerged <- merge(airlinecodes, flights, by="carrier")
CrossTable(flightsmerged$airline, flightsmerged$diverted)
merging data


Converting to Data Frame
A data frame is a dataset. To convert output from a function to a data frame, you can use the function.
names(origin)<-c("code", "origin")
names(dest)<-c("code", "dest")
airports <-merge(origin, dest, by="code", all=TRUE)
converting data


Exporting Data
To export a dataset to a CSV file, you can use the write.csv() function.
write.csv(airports, file="Airports.csv", row.names=FALSE)


[1] The flights dataset is modified from the original version on the Kraggle website.
[2] The tutorial code and the workshop Powerpoint presentation can be found here.
[3] You can enroll in the introductory R Quercus course here.

Data format
Read as a PDF