Before we can get into the details of loading pre-existing data sets into R, we need to cover a prerequisite: pacakges.
Because R is an open-source program, it is easy to write extensions, and thousands of people have. These extensions come in the form of “packages,” and these packages contain mostly functions that do stuff with data.
For example, I’ve written an R package that does plotting. (I’ve since deferred to ggplot2.) I’ve written another package that estimates a statistical model.
Hadley Wickam has written many R packages, some of which we’ll use. He’s written haven, which we can use to read proprietary data formats, such as Stata (.dta
) files. He’s also written readr, which we can use to quickly read in better formats, such as comma-separated values (.csv
), which I’ll encourage you to use throughout the course. He’s also written a package that is helpful for creating plots called ggplot2. We see that later.
In order to use a package, it must be installed once per workspace and then loaded in each session (i.e., after each restart). Some packages come pre-installed in R (e.g., stats, MASS). Some of these pre-installed packages are automatically loaded in each session (e.g., stats), while others must be loaded manually in each session (e.g., MASS) if you want to use them. Other packages that do not come pre-installed with R need to be installed manually (but just once per workspace) and loaded in each session.
If you click the “Packages” tab in RStudio (positioned in the upper-right panel by default), it will show you a list packages that are currently installed. I’ve attached a screenshot of my installed packages below, but your’s might look slightly different–I’ve probably installed a lot more than you.
If you want to use a function from a package and it is not installed, you must first install it.
You only need to install a package once per workspace. Once you’ve installed it, you have it in that workspace forever.
Review Exercises
In order to use a function from a package, though, the package must be installed and loaded. In the packages list, the check box (beside the package name) indicates whether the package is loaded. In the screenshot above, you can see that I only had one package loaded at the time I took the screenshot–blme, a package useful for estimating Bayesian linear mixed-effects models.
Let’s see how to access functions that are in packages. First, let’s create some data to plot.
x <- c(1, 3, 2, 5, 4)
y <- c(2, 1, 4, 3, 5)
Now let’s try to use the qplot()
function in the ggplot2 package to create a scatterplot of x
and y
. I’m assuming that you’ve already installed the ggplot2 package (it’s included in the tidyverse package).
Now, let’s try to use qplot()
.
qplot(x, y)
## Error in qplot(x, y): could not find function "qplot"
You’ll notice that the qplot()
function cannot be found. That’s because while ggplot2 is installed (i.e., present in the library), it is not loaded (i.e., off the shelf). If you want to use a function from the ggplot2 package, you need to load the package using the library()
function. You need to load the package each time you start a new session (e.g., restart RStudio), so be sure to include this in your script. Notice that you do not need to use quotes around the name of the package when using the library()
function.
library(ggplot2)
qplot(x, y)
When we use library()
, it loads the entire set of functions in the package into our workspace, making them all accessible to us. There are hundreds of functions in the ggplot2 package. If we prefer to avoid loading all the functions, we can use the syntax package::function()
to tell R where to find the function in the library without loading all the functions into the workspace.
ggplot2::qplot(x, y)
ggplot2::qplot(x, y)
will work whether or not ggplot2 is loaded.
I don’t have strong feelings about which approach is better–it depends on the context. If you only need to use one function from package one time, then perhaps it makes more sense to use the package::function()
approach. If you’ll be using many functions many times, then it makes sense to use library()
. Use whichever makes most sense to you. However, I’ll tend to use package::function()
a lot, because it makes it clear where the function is coming from. For example, if I write readr::read_csv()
, you know that I’m referring to the read_csv()
function in the readr
package.
Review Exercises
library()
function to load the tidyverse, haven, and readxl packages.Rather than manually entering data using c()
or something else, we’ll want to load data in stored in a data file. For this class, these will usually be one of three types:
.rds
files. This is the easiest format because it stores factors as factors and all the related information. Read with readr::read_rds()
. When you load the tidyverse package, it automatically loads readr as well..csv
files. This is a common data format. However, it stores factors as character strings, so the data file does not contain the information about factors. Read with readr::read_csv()
..dta
files. Another common data format because many political scientists use Stata. Read these files into R with haven::read_dta()
..xlsx
files. Read these files into R with readxl::read_excel()
..csv
)Data can be stored in a wide range of formats. One popular format, for example, is Stata’s proprietary .dta
format. I typically use (and encourage you to use) the comma-separated values .csv
format. The .csv
format is excellent because it is open and simple. This means that anyone can use it without acess to proprietary software. It will also be useble by anyone into the foreseeable future. We can see why .csv
files are easy to work with if we open it up the file nominate.csv
with a text editor. You’ll see that you can read the file directly. You don’t really need software, you can do it with your eyes!
I tried the same thing for a similar .dta
file. You can see that it looks like nonsense. You’ll definitely need Stata (or other speciallized software) to work with this file.
Also, .csv
files are easy to support, so they work in almost all data analysis software. For example, we can open up nominate.csv
in Excel. You can see that we have six variables in the columns and many cases in the rows (we don’t know how many because they overflow the screen). In this case, each row represents a particular Congressperson from a particular Congress (and Presidents are also included). The second row, for example, is for Rep. Callahan (R) from the 1st Congressional District of Alabama. During the 100th Congress, Rep. Calahan has a ideology score of 0.358, which means he’s conservative, but not as conservative as Pres. Reagan, who has a score of 0.747. We’ll work with these data a lot thoughout the semester, so we’ll have plenty of time for closer examination.
First, you must get your data file into the project directory. First, download the data from the data page. Notice where your computer saves the file. Since you are using the cloud, then select the Files tab in the lower-right panel of the Rstudio window. Then click Upload > Choose file. Navigate to and select the data file. Click Choose. Click Okay. Notice that the data file appears in your list of files.
Now we simply need to run readr::read_csv("cool-data.csv")
to load the data set (cool-data.csv
is just a placeholder for the actual name of the data set).
Review Exercises
To practice, download the data set nominate.rds
from the data page. Open RStudio and set the working directory to your class folder. Read those data into R. Now try it with the other file formats. See the code below for examples. Note that these datasets contain the same information, so you only need to choose one of the four formats. I show you how to load all four formats here for examples.
# Before this will work, upload the data files.
# load packages (install if needed)
library(tidyverse) # for read_csv() and read_rds() functions
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(haven) # for read_dta() function
library(readxl) # for read_excel() function
# read data
nominate <- read_csv("nominate.csv") # read csv file
## Rows: 7080 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): chamber, state, party, name
## dbl (3): congress, district, ideology
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nominate2 <- read_dta("nominate.dta") # read dta file
nominate3 <- read_rds("nominate.rds") # read rds file
nominate4 <- read_xlsx("nominate.xlsx") # read xlsx file
# check that data read properly
glimpse(nominate) # quick look at the data
## Rows: 7,080
## Columns: 7
## $ congress <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 1…
## $ chamber <chr> "House", "House", "House", "House", "House", "House", "House"…
## $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AK", "AZ", "AZ", "…
## $ district <dbl> 2, 4, 3, 5, 6, 1, 7, 1, 2, 3, 5, 4, 1, 3, 1, 4, 2, 36, 10, 29…
## $ party <chr> "Republican", "Democrat", "Democrat", "Democrat", "Democrat",…
## $ name <chr> "DICKINSON, William Louis", "BEVILL, Tom", "NICHOLS, William …
## $ ideology <dbl> 0.398, -0.213, -0.042, -0.175, -0.060, 0.373, -0.085, 0.279, …
Almost the statistical computation we do in this class revolves around datasets. In R, it usually makes sense to store data sets as specific objects known as data frames. Data frames are simply a collection of vectors that all contain the same number of elements. These might be numeric, character, factor, or logical vectors, or some mixture of types.
When you read a data set into R using readr::read_csv
, readr::read_rds()
, haven::read_dta()
, or some other method, it creates a data frame. A data frame is a special R object that holds a set of vectors that all have the name number of elements. If you think of the dataset as an Excel spreadsheet, then you can think of the columns of the spreadsheet as the vectors held by the data frame. These vectors (or “variables”) can be numeric, character, factor, or logical. As a reminder, here are the variable types:
numeric
: numbers, such as 1.1, 2.4, and 3.4. Sometimes numeric variables are subdivided into integer
(whole numbers, e.g., 1, 2, 3, etc.) and double
(fractions, e.g., 1.47, 3.35462, etc.).character
: text strings, such as "Republican"
or "Argentina (2001)"
.factor
: cateogories, such as "Very Liberal"
, "Weak Republican"
, or "Female"
. Similar to character
, except the entire set of possible levels is defined. A factor
variable may be ordered or unordered.logical
: true or false, such as TRUE
or FALSE
.For the .csv
files we will sometimes use, R cannot distinguish between character
and factor
variables. By default, readr::read_csv()
will load these as character
variables–there’s no way for R to know the entire set of levels from the .csv
file anyway. Sometimes, though, it will be useful to work with factor
variables. This is straightforward to change. To save you a bit of frustration, we will usually use the .rds
format, where the factors are already set up properly.
A data frame holds the variables, but it also hides the vectors. For example, the data frame nominate
, which we loaded above, has a numeric variable ideology
, but if we try to sum it, we get an error.
sum(ideology) # fails because the variable ideology is hidden in a data frame
## Error in eval(expr, envir, enclos): object 'ideology' not found
We’ve loaded the data set, but R can’t seem to find the variable. That’s because the variable ideology
is hidden in the data frame nominate
.
In order to access variables in data frames, we need to do one of two things.
$
operator.data
argument.Some functions, such as exp()
are designed to work with vectors, not data frames. This will be the case for most functions we use (with the notable exceptions of plotting in with ggplot()
and estimating liner model with lm()
). To use the functions on variables stored in data frames, we need to use the $
operator.
Suppose we have a data set loaded and given to the object my_data
. If my_data
contains the variable of interest my_variable
, then we can access my_variable
using the syntax my_data$my_varible
. That is, the syntax data$var
means “get the variable var
from the data set data
.” We’ll use this often, so make sure it’s clear.
sum(nominate$ideology) # example of the $ operator
## [1] NA
sum(nominate$ideology, na.rm = TRUE) # example of the $ operator
## [1] 164.457
But some functions are designed to work with data frames. For example, the qplot()
function in the ggplot2 package is designed to work with data sets. If you open the help file for qplot()
(i.e., help(qplot)
after library(ggplot2)
), you’ll see that one of the arguments is data
. If you use this argument to point qplot()
to the data frame, it will know where to find your variables.
# load ggplot2 package, which contains the qplot() function
library(ggplot2)
# example of a function with a data argument
qplot(ideology, data = nominate) # using the data argument
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1 rows containing non-finite values (stat_bin).
Many of the functions we use take a data argument. If they do not, though, we’ll need to use the $
operator. Because we’ll almost always use data stored in data frames, you need to be sure to use one approach or the other. If the function has a data argument, use it. In other cases, use the $
operator.
To load a data set in R, you must…
read_csv("nominate.csv")
). Be sure to load any necessary packages. Be sure to assigned the data to a well-named object (i.e., using <-
) so you can use it later.Review Exercises
glimpse()
(part of tidyverse) to get a quick look at each data set. Do the data sets seem identical?mean()
, sum()
, etc.) on the data frames or variables in the data frames. Using #
comments, explain what each function is doing, if you can. Just experiment a bit.