Chapter 10 R in the wild
10.1 Today’s lab as a reference
Unlike the previous chapters, I am not going to ask you to start a practice document and follow along. Just read the chapter and know that this a resource that you have available as you become more advanced in your R skills and need more advanced guidance.
10.2 Importing data into R
10.2.1 Importing CSV files
For this workbook, you have been able to access our three dataframes—world, states2010, and ane2020—as .Rda files. In other words, these files were already formatted for use with R. However, if you keep using, you will regularly run into data that has not been formatted for use with R.
For example, when I wanted to generate the ANES dataframe, I went to the ANES website: https://electionstudies.org/data-center/2020-time-series-study/. I had to create a log in to download data, and once I did that, I was taken to a page that gave me a few different choices of formats to download the data in, and I chose to download the csv version, which is often a good choice. “CSV” stands for comma separated values, and it is a simple form of data storage that many different software packages can understand.
I next created a folder called anes2020, and in that folder, I made three new folders: one called rscripts, one called data, and one called notes (where I keep my codebooks and other reference materials related to that data). I then moved the cvs file that I had downloaded (which downloaded as anes_timeseries_2020_csv_20210719.cvs) into the data folder.
After that, I made a new R project which I set to be located in that anes2020 folder. Next, I imported the data with this command:
anes2020<-read.csv("data/anes_timeseries_2020_csv_20210719.csv")
That dataset is huge: 1,771 variables! I read over the codebook (which I had also downloaded from the same webpage) and chose the variables that I wanted to keep for this class. I used this command to make a simplified version of that dataset (you need to have the tidyverse package loaded to use this command).
anes2020.simplified<-anes2020 %>%
select(V201005, V201103, V201115,V201117, V201151,V201152,V201156,V201157, V201200,V201217, V201225x, V201231x,V201233,V201234,V201237,V201262, V201324, V201336,V201337,V201345x,V201351,V201356x,V201366,V201368, V201369,V201377, V201392x,V201401,V201405x,V201414x,V201433, V201434,V201435,V201453,V201507x,V201508,V201510,V201516,V201544,
V201549x,V201567, V201600,V201601,V201617x,V201628,V201644, V202159,V202160,V202161,V202162, V202163,V202164,V202165,V202166,
V202167,V202168,V202169,V202170, V202171,V202172,V202173,V202174,
V202175,V202178,V202181,V202182,V202183,V202187,V202073)
That command looks like a lot, but all it is doing is telling R to make a new dataframe called anes2020.simplified and then selecting “only” that list of variables (instead of the 1,771 variables in the original dataframe).
10.2.2 Importing Excel files
Sometimes you will find an Excel file that you want to import. There is a very useful package that you can install to help with that called “readxl.” Once you have installed it, you can use the “read_excel” command to bring Excel files into R.
After I downloaded the Excel version of the “DEMOCRACY CROSS-NATIONAL DATA, RELEASE 4.0 FALL 2015” from Dr. Pippa Norris’s webpage, https://www.pippanorris.com/data. I read it into R with this command:
world <- read_excel("data/Democracy Cross-National Data V4.1 09092015.xlsx")
#remember, this command only works if you have installed and called
#up the readxl library
10.2.3 Importing other files
When dealing with data that is not formatted as a Rda, Excel, or CSV file, the package “foreign” is often helpful. Here is an example of how I have used the read.spss
command from the package “foreign” command to read in an SPSS-generated .sav file:
df <- read.spss("data/previous semester data.sav", use.value.label=TRUE, to.data.frame=TRUE)
10.3 Cleaning data in R
One you have imported data, there are a number of things that you want to do to do before you do analysis. It always helps to glimpse() your new dataframe and look at the codebook, if you can find one. When I was first looking at the ANES data for this lab workbook, I noticed that many of the variables had long names that were difficult to interpret, like this: V201005. I renamed them with this tidyverse command (note that the real command was longer because it included all the variables in the dataframe):
anes2020.simplified<-anes2020.simplified %>%
rename(attention=V201005,
v2016=V201103,
hope=V201115,
outrage=V201117,
ft_biden=V201151,
ft_trump=V201152,
ft_dems=V201156,
ft_reps=V201157)
You might find that are ordinal or nominal variables that are coded as numeric which you want to recode as factors, using the as.factor command, which you can do like this:
df$variable<-as.factor(df.variable)
You can also use the codebook and/or the table() command to figure out how data is coded. Often missing data is coded as 8, 9, 99, or negative numbers. You want to make sure to recode that as NAs or R will treat those values as numbers and mess up your analysis. I used the tidyverse command mutate(na_if())
to do that. Here is how I accomplished that with the age variable in ANES:
anes2020.simplified<-anes2020.simplified %>%
mutate(age=na_if(age, '-9'))
And here is how I recoded the marital variable to exclude NAs and to give the numeric variables values that made sense:
anes2020.simplified<-anes2020.simplified %>%
mutate(marital=na_if(marital, '-9')) %>%
mutate(marital=na_if(marital, '-8')) %>%
mutate(marital=recode(marital, '1'="married",'2'="married",'3'="widowed",
'4'="divorced",'5'="separated",'6'="never married"))
10.4 Online R resources
There are excellent free resources for using R online. Here are some of my favorites:
https://stackoverflow.com/ This is a community of people that ask and answer questions about lots of different software packages including R. You can search all of the questions for free as a non-member, or you can join (also free) and ask and answer questions. I have figured out how to do many things in R thanks to the community at this site.
https://bookdown.org/ndphillips/YaRrr/ This is a very fun, free, online introduction to R, called YaRrr! The Pirate’s Guide to R, by Nathaniel D. Phillips. Actually, Dr. Phillips claims that he discovered the book and translated it from “pirate-speak” into English.
https://suzanbaert.netlify.app/2018/01/dplyr-tutorial-1/ This is part one of a four-part series of blog posts on “Data Wrangling,” by Suzan Baert. All four parts are very helpful in learning to manage data in R.
https://r-graphics.org/ This is the R Graphics Cookbook, by Winston Chang. It is an extremely useful resource when using ggplot. It shows you how to make many, many kinds of graphs, and also how to modify the graphs that you have already made.
https://www.apreshill.com/project/ohsu-dataviz/ This is a really nicely done, free, asynchronous online class called ” Principles & Practice of Data Visualization” taught by Alison Hill. This class teaches you about data visualization in R with ggplot.
https://r4ds.had.co.nz/ This textbook, R for Data Science, by Hadley Wickham and Garrett Grolemund is a terrific resource, available for free online.
https://isabella-b.com/blog/ggplot2-theme-elements-reference/ This is a really helpful guide to theme elements in ggplot2 by Isabella Benabaye.
https://henrywang.nl/ggplot2-theme-elements-demonstration/ This is a nice ggplot2 theme elements demonstration by Henry Wang.
http://mattwaite.github.io/sports/ This textbook, Sports Data Analysis and Visualization by Matt Waite is a great resource for using R to analyze sports.
https://bookdown.org/yihui/bookdown/ I used this textbook, bookdown: Authoring Books and Technical Documents with R Markdown, by Yihui Xie, to learn how to put this workbook online.
10.5 The last command: how to get help
As you continue with R, you will often find yourself in a situation where you want to know what a command is for, or the specifics of how to use it. When you are in this situation, you can type a “?” followed by that command, like this:
?table
R will pull up documentation on the command. R’s documentation is sometimes hard to read (and reading it takes practice) but it can be a good place to start, before looking at the online resources in the previous section.
If the command is from a package that you don’t have installed at that time, you can type two question marks, and R will look through all of the packages in its online libraries, like this:
??cut2
10.6 The Strausz method for improving in R
After taking this class and working through this workbook, you may decide that you want to keep using and improving at R for the reasons that I laid out in section 1.1: because R is free, powerful, professional-grade software and skills in R will really impress potential employers (and your friends and family!). Below is the method that I used to learn R.
Every day, I set a timer for 15 minutes, and spent that time doing something in R. I began with though Nathaniel D. Phillips’ Pirate’s Guide and then I worked my way through the other resources in section 10.4. I also spent that time figuring out how to use R to solve problems in my own research; I made a vow to myself that from now on, I would do all analysis for my own research in R.
You can also use that 15 minutes to do anything in R. Maybe you want to use R to analyze your favorite sport. Maybe you want to use R to learn about knitting patterns, or to keep track of data regarding your other hobbies or interests. There is even a package available called sourrr that is useful when generating recipes for sourdough bread (although I have found my favorite sourdough bread recipes in these books). Maybe you want to use R to gather and analyze data relating to games that you play or debates that you have with your friends. Or, maybe you want to use R to help you with research projects for other classes. Just choose something to do in those 15 minutes, and do it. If you do that for a while, you will surprise yourself with how much R you can do.