Chapter 5 Making controlled comparisons
5.1 Getting started with this chapter
To get started in today’s chapter, open the project that you made in lab 1. If you forgot how to do this, see the instructions in section 2.2.
Now, open a new script file and save it in your scripts folder as “chapter 5 practice.” Copy and paste this onto the page (updating the text so that it is about you):
####################################
# Your name
# 20093 Chapter 5, Practice exercises
# Date started : Date last modified
####################################
#libraries------------------------------------------
library(tidyverse)
library(Hmisc) #the cut2 command helps us simplify interval variables
library(scales) #this is useful when the labels on our axes overlap
library(tigerstats) #colPerc can also be useful with crosstabs
library(flextable) #this makes tables that we can easily export into a word processor
Now select all the text on this page, run it, and save it.
5.2 Representing crosstabs with stacked bar graphs
In chapter 4 we learned how to generate crosstabs (and, crucially for interpretation, crosstabs with percents in columns). Crosstabs themselves have a lot of numbers on them, and can thus be difficult to interpret on their own. Many social scientists address this problem by generating graphs. A “stacked bar graph” is a really nice tool for visually depicting relationships in a bar graph. The code to generate stacked bar graphs is a little bit long and complex, so just feel free to copy and paste the below or the code from the “Strausz ggplot2 templates” which is in the “rscripts” folder that was in the zipped filed that you should have downloaded in section 1.3. Let’s return to the relationship that we were looking at in section 4.5, between education (our independent variable) and income (our dependent variable).
Here is the code to generate a stacked bar graph of this relationship:
#first, generate a dataframe called "plotting.data" based on grouped variables from #anes2020
plotting.data<-anes2020 %>%
filter(!is.na(edu)&!is.na(income)) %>%
group_by(edu, income) %>%
summarise(n=n()) %>%
mutate(freq = n / sum(n))
#second, make the graph
ggplot(plotting.data, aes(x = edu, y = freq, fill=income)) +
geom_col(position = "fill") +
scale_y_continuous(labels=percent)+
scale_x_discrete(guide = guide_axis(n.dodge=2))+ #this line is in here because without it the x-axis labels overlap. You can remove it if that isn’t a problem
scale_fill_viridis_d(name="income")+ #this color palate is made to be readable for colorblind people
ggtitle("Education and income in the 2020 ANES")+
xlab("Education")+
ylab(NULL)
One thing to note in this code: it begins by creating a new dataframe called “plotting.data.” When you run it, that dataframe will show up in your RStudio environment. If you run it again with new variables, the new “plotting data” that you make will overwrite the old one.
If it bothers you having that “plotting.data” object in your environment, you can always remove it with the command
rm(plotting.data)
Another thing to note: the “scale_fill_viridis_d()+” line of code asks R to use a color palette that is easy to read for people that are colorblind. There are many, many choices of color palette in R, and you are welcome to play around with colors, but it is also fine to just stick to this palette. If you’d rather have a black and white graph, there is code for that in the “Strausz ggplot2 templates” which is in the “rscripts” folder that was in the zipped filed that you should have downloaded in section 1.3.
Looking at the graph, we can see that our X-axis is our independent variable, education, and our Y-axis is the column percentages of our dependent variable, income. Focusing on the yellow region, we can clearly see that each increase in education seems to correspond to an increase in the percent of people in the highest income earning category (with the possible exception of the movement between some college and associate degree). Focusing on the dark blue region, we can clearly see that, as level of education increases, the percent of people in the lowest income category decreases. In short, this graph has all of the same information that we got when looking at the crosstab that we generated with:
colPerc(xtabs(~income+edu, data=anes2020))
from our previous lab, but it is much easier to take in a bunch of that information at the same time.
5.3 Representing mean comparison tables with boxplots
In addition to the histogram that we introduced in chapter 3, another useful way of representing the distribution of a single interval variable is with a boxplot. Here is the code for a boxplot of the variable (you can also find this code in the “Strausz ggplot2 templates” which is in the “rscripts” folder that was in the zipped filed that you should have downloaded in section 1.3):
#make a boxplot of evangelicals
ggplot(states2010, aes(y = evangelical_pop)) +
geom_boxplot(color="black", fill="purple", alpha=0.2)+
scale_x_continuous(breaks = NULL) +
ggtitle("Religion in US states")+
xlab(NULL)+
ylab("Percent of state residents that are evangelical")
A boxplot gives us some interesting information about an interval variable. The bottom of the box is the 1st quartile (25% of cases have that value or lower) and the top of the box represents the 3rd quartile (75% of cases have that value or lower). The horizontal line represents the median value, and the vertical lines coming out of the box extend to the first quartile minus 1.5 multiplied by the interquartile range and the third quartile plus 1.5 multiplied by the interquartile range. Cases above and below this line are outliers, and are identified by points on the graph. The one outlier in this graph is Utah, which I figured out using this code:
states2010 %>%
select(state, evangelical_pop) %>%
arrange(desc(evangelical_pop))
Here is a labeled version of a boxplot for your reference:
So, in other words, we can learn a lot about a variable from a single graph!
Boxplots are also a useful way to look at the properties of an interval variable across several values of a nominal or ordinal variable, which is what we were doing with our mean comparison table in chapter 4. Let’s take a graphical look at the relationship between the region that a state is in and the percent of evangelicals that live in that state. We can set up that kind of graph with the following code:
#make boxplots of regions and evangelicals
ggplot(states2010, aes(x = region, y = evangelical_pop)) +
geom_boxplot(color="black", fill="purple", alpha=0.2)+
ggtitle("Region and religion in US states")+
xlab("Region")+
ylab("Percent of state residents that are evangelical")
This graph shows us a lot of cool things. We can see that the states with the smallest percentages of evangelicals in the South appear to have roughly the same percent as the highest outlier in the Northeast. We can also see that, despite the existence of an extreme outlier (which we learned above is Utah), the states in the West still seems to have a smaller percent of evangelicals than states in the South.
5.4 Adding a control to a crosstab
So far, we have learned to look at the relationship between pairs of variables. However, in real social scientific analysis we will often be interested in more than that. We will be interested in other factors that could be the true cause of variation in our dependent variable. If we look at the relationship between our independent and dependent variable at different values of a control variable and there no longer seems to be a relationship, then we can say that the correlation that we observe between our independent and dependent variable is spurious. If the relationship between the independent and dependent variable is roughly the same at different values of a control, we call the relationship between the independent and dependent variables, controlling for the third, additive. If the relationship between the independent and dependent variables is different at different values of a control, we call that relationship an interaction.
Let’s return to the relationship that we looked at earlier, between education and income. Here is the crosstab that we generated in chapter 4:
## edu
## income less than high school high school some college
## under $24,999 46.51 33.09 24.76
## $25,000-64,999 35.47 35.76 36.13
## $65,000-$109,999 11.63 20.63 21.97
## $110,000 or more 6.40 10.52 17.14
## Total 100.00 100.00 100.00
## edu
## income associate degree ba ma md, phd, or jd
## under $24,999 20.21 11.55 8.77 8.16
## $25,000-64,999 35.37 25.61 17.35 13.16
## $65,000-$109,999 26.92 28.75 30.05 25.26
## $110,000 or more 17.49 34.08 43.84 53.42
## Total 100.00 100.00 100.00 100.00
Let’s look at the row representing the people who make $110,000 or more per year. As we move from the lowest level of education to the highest, we can see that the percent of people in the highest income category moves from 6.4% to 43.84%. We can actually subtract 6.4 from 43.84 to say that the effect size is roughly 37.44%. In other words, 37.44% more people with the highest degree of education are in the highest income category than are people in the lowest education category.
Do we think that this result would be roughly the same for men and women? To answer this question, we can use a neat tool that we have actually played with a bit in previous chapters: the “filter” command that is a part of the dplyr package (and thus works with pipes). The ANES dataset has a variable called sex, where 0 means “men” and 1 means “women.” Let’s have R make that crosstab again, but only for men, and then again, but only for women. To do that, we can use this code:
## edu
## income less than high school high school some college
## under $24,999 38.69 26.56 20.31
## $25,000-64,999 36.31 37.54 34.27
## $65,000-$109,999 16.67 22.95 23.13
## $110,000 or more 8.33 12.95 22.28
## Total 100.00 100.00 100.00
## edu
## income associate degree ba ma md, phd, or jd
## under $24,999 13.65 11.30 8.00 6.93
## $25,000-64,999 34.99 24.76 15.16 12.38
## $65,000-$109,999 32.01 27.45 28.42 23.76
## $110,000 or more 19.35 36.49 48.42 56.93
## Total 100.00 100.00 100.00 100.00
A quick note: if we wanted to use filter with a variable whose values are defined by text, such as our income variable, we must put the text in quotation marks. Here, we do not have to use quotation marks because sex is defined numerically.
If we do the same comparison that we did above, we can see that as men more from the lowest to the highest education category, the percent in the highest income category increases from 8.33% to 48.42%. The total effect size is about 40.09%, which is a few percent higher than the 37.44% overall effect size that we witnessed above.
To save this crosstab as in a word file that we can easily pull into various word processors, we can use this code, based on what we did in Section 4.5.4.
income.edu.men<-colPerc(xtabs(~income+edu, data=anes2020
%>% filter(sex==0)))
income.edu.men<-as.data.frame(income.edu.men)%>%rownames_to_column("income")
flextable(income.edu.table) %>% save_as_docx(path="incomeEduMen.docx")
Now let’s look at women only:
## edu
## income less than high school high school some college
## under $24,999 53.98 39.42 28.32
## $25,000-64,999 34.66 33.97 37.69
## $65,000-$109,999 6.82 18.43 21.04
## $110,000 or more 4.55 8.17 12.95
## Total 100.00 100.00 100.00
## edu
## income associate degree ba ma md, phd, or jd
## under $24,999 24.52 11.71 9.35 9.60
## $25,000-64,999 35.58 26.37 19.03 14.12
## $65,000-$109,999 23.56 30.04 31.29 27.12
## $110,000 or more 16.35 31.87 40.32 49.15
## Total 100.00 100.00 100.00 100.00
Applying the same analysis that we did above, we can see that as women’s education level increases from the lowest to the highest level, the percent of people in the highest income bracket moves from 4.55% to 40.32%; a total effect size of 35.77% (compared with 40.09% for men). In other words, this data suggests that both men and women are economically rewarded for more education, but that men seem to be rewarded a bit more. If we were to write up these results, at this point we would have to use our own judgment to decide whether the effect size differs enough to call the relationship between education and income controlling for gender an interaction, or whether it makes more sense to call it additive. We can use more advanced statistical tools to make that determination with more rigor, but for now, we will rely on our judgment.
We can also save this table as a Word document, using the following code:
income.edu.women<-colPerc(xtabs(~income+edu, data=anes2020
%>% filter(sex==1)))
income.edu.women<-as.data.frame(income.edu.women)%>%rownames_to_column("income")
flextable(income.edu.table) %>% save_as_docx(path="incomeEduWomen.docx")
There is not a single tool to graph relationships between three ordinal or nominal variables in an easy to interpret way. Instead, we can use the filter%>% command from above to make two graphs, and then use insert table in our word processor to nearly display them side by side.
Here is the code for the graph for only men (with the line “filter(sex==0) %>% #filtering so that we only look at men” added):
plotting.data<-anes2020 %>%
filter(!is.na(income)&!is.na(edu)) %>%
filter(sex==0) %>% #filtering so that we only look at men
group_by(edu, income) %>%
summarise(n=n()) %>%
mutate(freq = n / sum(n))
ggplot(plotting.data, aes(x = edu, y = freq, fill=income)) +
geom_col(position = "fill") +
scale_y_continuous(labels=percent)+
scale_x_discrete(guide = guide_axis(n.dodge=2))+
scale_fill_viridis_d()+
ggtitle("Education and income for men in the 2020 ANES")+
xlab("Education")+
ylab(NULL)
And here is the code for the graph for only women:
plotting.data<-anes2020 %>%
filter(!is.na(income)&!is.na(edu)) %>%
filter(sex==1) %>% #filtering so that we only look at women
group_by(edu, income) %>%
summarise(n=n()) %>%
mutate(freq = n / sum(n))
ggplot(plotting.data, aes(x = edu, y = freq, fill=income)) +
geom_col(position = "fill") +
scale_y_continuous(labels=percent)+
scale_x_discrete(guide = guide_axis(n.dodge=2))+
scale_fill_viridis_d()+
ggtitle("Education and income for women in the 2020 ANES")+
xlab("Education")+
ylab(NULL)
5.5 Adding a control to a mean comparison table
The ANES survey has a number of “feeling thermometer” questions, where respondents were asked how warmly or coldly they felt toward institutions, individuals, and groups of people. These questions all range from 0 (meaning that the respondent feel very negatively toward the subject), to 100 (meaning that the respondent feels very positively about the subject). For this section, I am going to treat the feeling thermometer about Donald Trump as my dependent variable. For my independent variable, I am going to look at a question that respondents were asked about how outraged they feel about how things are going in the country these days, which is coded as “outrage” in the anes2020 dataframe. My hypothesis will be this: when comparing individuals, those that were outraged about how things are going will be more likely to feel negatively about Donald Trump than will those who were not outraged about how things are going.
If you runtable(anes2020$outrage)
, you will see that the variable is coded ordinally with five values ranging from “not at all” to “extremely.” Before running the analysis, I am going to simplify this variable so that we are comparing people who are “very” or “extremely” outraged with everyone else. I am going to call my new variable “outrage2.” Here is the code that I will use to simplify this variable:
anes2020<-anes2020 %>%
mutate(outrage2=recode(outrage, 'very'="outraged", 'extremely'="outraged",
.default="not outraged"))
Notice how this code uses the .default qualifier rather than typing out the three other possible values that I wanted to recode into “not outraged.” That is a nice shortcut.
Now, let’s make a mean comparison table to see how the mean Donald Trump feeling thermometer score differs between outraged and non-outraged people:
anes2020 %>%
filter(!is.na(ft_trump)&!is.na(outrage2)) %>%
group_by(outrage2) %>%
summarise(mean=mean(ft_trump), sd=sd(ft_trump), n=n())
## # A tibble: 2 × 4
## outrage2 mean sd n
## <ord> <dbl> <dbl> <int>
## 1 not outraged 56.6 35.9 3420
## 2 outraged 28.4 39.2 4617
This suggests that our hypothesis is true. People who were outraged in 2020 gave President Trump an average score of 28.4 on a feeling thermometer, while people who were not outraged gave President Trump an average score of 56.6 on a feeling thermometer: a difference of 28.4. That seems like a pretty substantial difference!
To graph this relationship, we can use this code:
ggplot(anes2020 %>% filter(!is.na(outrage2)), aes(x = outrage2, y = ft_trump)) +
geom_boxplot(color="black", fill="purple", alpha=0.2)+
ggtitle("ANES 2020, Feelings about President Trump by level of outrage")+
xlab("Is R outraged?")+
ylab("Feelings about President Trump\n(100 is warm and 0 is cold)")
This graph shows us that the feeling thermometer scores for President Trump seem to have ranged from 0 to 100 for both outraged and non-outraged respondents. However, the median score for outraged respondents was a 0, while the median score for not outraged respondents was a 65, which is just below the 3rd quartile of outraged respondents (which was a 70).
When doing this analysis, I suspected that what is really going on had to do with partisanship. Maybe democrats were more likely to be outraged than Republicans since the president was a Republican. To test this hypothesis, I took a look at my partyid variable, and I saw that it had seven values ranging from “strong democrat” to “strong republican.” I would like to simplify this variable to make it a bit easier to interpret these results, so I do so with the following code (notice how I save myself some typing with the .default qualifier):
anes2020<-anes2020 %>%
mutate(partyid3=recode(partyid, 'strong democrat'="democrat",
'not very strong democrat'="democrat",
'independent democrat'="democrat",
'independent'="independent",
.default="republican"))
Now let’s add our control variable to our mean comparison table. It is easier to add a control variable into a mean comparison table than it was to add a control variable into our cross tab. To do so, we can use the following code:
anes2020 %>%
#first I filter out the NAs for all 3 variables
filter(!is.na(ft_trump)&!is.na(partyid3)&!is.na(outrage2)) %>%
#then I group by the control first, and then the IV
group_by(partyid3, outrage2) %>%
#then I summarise in the same way that I do with a two-variable table
summarise(mean=mean(ft_trump), sd=sd(ft_trump), n=n())
## # A tibble: 6 × 5
## # Groups: partyid3 [3]
## partyid3 outrage2 mean sd n
## <ord> <ord> <dbl> <dbl> <int>
## 1 democrat not outraged 20.8 25.9 955
## 2 democrat outraged 4.56 13.9 2782
## 3 independent not outraged 45.7 30.1 487
## 4 independent outraged 24.5 30.8 436
## 5 republican not outraged 76.8 25.1 1968
## 6 republican outraged 77.3 29.5 1391
The comments explain what is happening in the code. It is almost the same as the code for a two variable mean comparison table, but I group by the control (first) and then the independent variable, rather than just grouping by the independent variable, as I did in the two-variable code.
Let’s look at this output. First, focus on the first two rows. They tell us that Democrats who were outraged in 2020 felt about 16.24 points more coldly toward President Trump than Democrats who were not outraged (because 20.8-4.56=16.24). For independents, the effect of outrage was even larger. Independents who were outraged felt 21.2 points more coldly toward President Trump than independents who were not outraged (because 45.7-24.5=21.2). For Republicans, there seemed to be very little relationship at all between outraged and feeling about President Trump. In fact, people who were outraged felt -0.5 points more coldly toward President Trump because 76.8-77.3=-0.5). Or, put another way, outraged Republicans felt 0.5 points more warmly toward President Trump than did non-outraged Republicans.
In other words, the relationship between outrage in 2020 and feelings about President Trump is very different at different levels of partisanship. We thus call this an interactive relationship.
We can save this three variable mean comparison table like this:
ft_trump.3var<-anes2020 %>%
filter(!is.na(ft_trump)&!is.na(partyid3)&!is.na(outrage2)) %>%
group_by(partyid3, outrage2) %>%
summarise(mean=mean(ft_trump), sd=sd(ft_trump), n=n())
flextable(ft_trump.3var)%>%
save_as_docx(path="ft_trump.3var.docx")
To graphically represent this relationship, we can use this code (which you can also find in the “Strausz ggplot2 templates” which is in the “rscripts” folder that was in the zipped filed that you should have downloaded in section 1.3):
ggplot(anes2020 %>% filter(!is.na(partyid3) & (!is.na(outrage2))),
aes(x = partyid3, y = ft_trump, fill=outrage2)) +
geom_boxplot()+
ggtitle("ANES 2020, Feelings about President Trump by PartyID and Outrage")+
xlab("Party ID")+
ylab("Feelings about President Trump\n(100 is warm and 0 is cold)")+
scale_fill_discrete(name = "Level of outrage")
This boxplot clearly shows how the relationship between outrage and feelings about President Trump was different for Democrats and independents than it was for Republicans. This is a really nice example of an interaction, and it suggests that “outrage” may have had different meanings for Democrats, independents, and Republicans.
5.6 Review of this chapter’s commands
Command | Purpose | Library |
---|---|---|
Filter(VARIABLE==VALUE) %>% | A way to filter out cases that you are not interested in. If the value you are interested in is coded as text (as is the case with our nominal and ordinal variables), be sure to put it in quotation marks. | dplyr (tidyverse) |
5.7 Review exercises
Let’s practice some of the things that we learned in this chapter.
- Create a new R script called “Chapter 5 Exercises,” and save it to your R scripts folder. Make a header for the script based on the description from section 2.3. Use the library() command to load the tidyverse, Hmisc, tigerstats, and scales libraries.
- Choose a variable from either anes2020, states2010, or world that you are interested in thinking about as a dependent variable (don’t choose one of the variables that we used as an example in this chapter). Generate an appropriate graph and calculate the relevant statistics to make observations about this variable’s central tendency and dispersion. Create a new file in your word processor, and paste your graph in. Also list all relevant statistics to help you make observations about this variable’s central tendency and dispersion.
- Choose a second variable from the same dataframe that you used for question 2 to use as an independent variable (i.e. as a variable that you think might cause variable in your dependent variable). In your word processor file, write a few sentences about what kind of relationship you expect to see between your independent and dependent variables and why.
- If the variable that you selected in #3 is interval, use the cut2() and/or mutate(recode()) commands to make it into an ordinal variable. If the variable that you selected on #3 is ordinal or nominal, make observations about the variable’s central tendency and dispersion in your word processor file (supported by an appropriate graph and appropriate statistics).
- Generate a crosstab or a mean comparison table (as appropriate) to examine the relationship between your independent variable from #3 and your dependent variable from #2, and also generate the appropriate graph to examine this relationship. Paste your graphics and your crosstab or mean comparison table (in a readable format) into your word processor file. Be sure to format your crosstab or mean comparison table in a way that is easy for the reader to understand (this might involve creating a table).
- In your word processor document, briefly (in 1-4 sentences) discuss what you conclude about your hypothesis based on the output from #5.
- Now choose another variable to use as a control. Using cut2() and/or mutate(recode()), recode this variable so that it has only a few values (two, or three at most).
- Generate a crosstab or a mean comparison table (as appropriate) to examine the relationship between your control from #7, your independent variable from #3 and your dependent variable from #2, and also generate the appropriate graph to examine this relationship. Paste your graphics and your crosstab or mean comparison table (in a readable format) into your word processor file. Be sure to format your crosstab or mean comparison table in a way that is easy for the reader to understand (this might involve creating a table).
- In your word processor document, briefly (in 1-4 sentences) discuss what you conclude about your hypothesis based on the output from #8. Would you say that the relationship between your independent variable and your dependent variable, controlling for your control variable, is additive, interactional, or spurious? Why?