Handout 7: Understanding Relationships Between Two Categorical Variables
STAT 100 – Spring 2016
1
In this handout, we will continue to discuss ways to investigate the relationship between two
categorical variables. Recall that a categorical variable takes on a value that is the name of a
category or a label. When examining the relationships between variables, it may also help to
consider the following definitions.
Definitions
Dependent Variable: The variable whose changes we wish to study. This is often called the
response variable.
Independent Variable: A variable under consideration that is used to explain changes in the
dependent variable. This is often called the predictor variable. Note that studies often
involve more than one independent variable.
Example: Investigating Survival on the Titanic
Consider the following screenshot of data from the Titanic, the famous ship that sunk in 1912.
This dataset can be found in the Titanic.xlsx file on the course web site.
Suppose you were asked to investigate the following questions. Identify the independent and
dependent variable for each scenario.
1. Are 1
st
class passengers more likely to have survived than 3
rd
class passengers?
Dependent Variable (i.e., the response):
Independent Variable (i.e., the predictor):
Handout 7: Understanding Relationships Between Two Categorical Variables
STAT 100 – Spring 2016
2
2. Are females more likely to have survived than males?
Dependent Variable (i.e., the response):
Independent Variable (i.e., the predictor):
An image of the R.M.S. Titanic floorplan.
Our task is to investigate the relationship between the variable Survival and other variables such
as Passenger Class and Gender. We are starting with raw data, and ultimately, we want to get a
table of counts like we did in Handout 6. Then, we will calculate relevant percentages from this
table to help us investigate the questions of interest. The Excel tool that allows us to obtain
appropriate summaries is called a PivotTable.
Handout 7: Understanding Relationships Between Two Categorical Variables
STAT 100 – Spring 2016
3
PivotTables are one of the most under-utilized tools in Excel when it comes to analyzing data.
PivotTables are powerful and permit us to slicethe data in various ways, which is important
when investigating relationships between variables.
Getting a PivotTable in Excel
To obtain a PivotTable in Excel, first highlight the data that you want to summarize. Then,
select Insert > PivotTable as is shown here. Note: Mac users can find PivotTable options on the
Data tab.
The Create PivotTable window will appear. If you have previously highlighted the data that
you want to summarize, it will appear in the Select a table or rangebox. If not, you should
enter the range of the data to be summarized in this box before proceeding.
The radio button near the bottom allows you to specify where the PivotTable will be placed. I
typically place the PivotTable on a New Worksheet. Once you have selected this, click OK.
Handout 7: Understanding Relationships Between Two Categorical Variables
STAT 100 – Spring 2016
4
The following is returned by Excel. This template is the basic template for a PivotTable.
To create a PivotTable, simply select the variables you’d like to summarize and place them in
either a row or column. The independent variable (i.e., the predictor) is typically placed in the
rows of the table and the dependent variable (i.e., the response) in the columns of the table.
Handout 7: Understanding Relationships Between Two Categorical Variables
STAT 100 – Spring 2016
5
Notice that this created Row and Column Labels for a table in the Excel sheet; however, nothing
has been summarized in the table so far. To obtain the appropriate counts in the table, you
must drop one of the variables of interest into the Values box in the lower right-hand corner.
To summarize, this is how we would set up the PivotTable in Excel to investigate differences in
Survival Status across Passenger Class.
The dependent variable (i.e., Survived) was placed in the Column Labels box
The independent variable (i.e., Passenger Class) was placed in the Row Labels box
Survived was placed in the Values box to obtain the counts
Do this in Excel and use your PivotTable to obtain the appropriate counts. Write them into the
following table. In the previous handout, we called this a bivariate frequency table.
Statisticians, however, often refer to this type of summary as a contingency table or cross-tab
table (cross-tab is short for cross-tabular).
Handout 7: Understanding Relationships Between Two Categorical Variables
STAT 100 – Spring 2016
6
Passenger
Class
Survived
Total
No
Yes
1
st
Class 323
2
nd
Class 277
3
rd
Class 709
Total
809
500
1309
Questions
1. How many first class passengers were on the Titanic? How many 3
rd
class passengers?
2. How many passengers survived? Did a majority of the passengers survive?
3. Your friend makes the following statement: “More 3
rd
class passengers survived than 2
nd
class passengers (181 vs. 119).” This statement is technically true, but it is misleading.
Explain why this is misleading.
Recall that when making comparisions between groups, we must take into consideration that
the groups may be of different sizes. To alleviate this concern, we compute row percentages.
Questions
4. Consider only the 3
rd
class passengers. How many 3
rd
class passengers were there?
How many survived?
5. What percentage of the 3
rd
class passengers survived?
Handout 7: Understanding Relationships Between Two Categorical Variables
STAT 100 – Spring 2016
7
6. Consider next the 2
nd
class passengers. What percentage of the 2
nd
class passengers
survived?
7. Compare the percentage of 3
rd
class passengers that survived to the percentage of 2
nd
class passengers that survived. How different are these percentages?
8. Suppose your friend disagrees with how you computed the survival rate for each group
above and instead uses the total number of passengers in the denominator.
Survival Rate for 2
nd
class passengers: 119/1309 ≈ 9%
Survival Rate for 3
rd
class passengers: 181/1309 ≈ 14%
Compare these percentages.
9. Do the comparisons made in Question 7 agree with those in Question 8? If they differ,
which comparisons are more meaningful? Discuss.
Handout 7: Understanding Relationships Between Two Categorical Variables
STAT 100 – Spring 2016
8
Getting Row Percentages in Excel
Excel can be used to directly calculate the necessary row percentages so that fair comparisons
can be made across Passenger Class. To obtain these row percentages, right-click on the Count
of Survived cell in your PivotTable. Select Value Field Settings
In the Value Field Settings window, select the Show Values As tab, and select % of Row Total from
the Show values as drop-down menu. Click OK.
Use your PivotTable output to obtain the missing row percentages in the following table.
Passenger
Class
Survived
Total
No
Yes
1
st
Class 38.1% 61.9% 100%
2
nd
Class 57.0% 100%
3
rd
Class 25.5% 100%
Handout 7: Understanding Relationships Between Two Categorical Variables
STAT 100 – Spring 2016
9
Questions
10. Explain how the value 61.9% was computed in the above table. Show the formula with
the actual numbers.
11. In this example, the percentages in the first row add up to 100%. Will this always be the
case when row percentages are computed? Discuss.
12. Compare the survival rate of 1
st
class passengers to that of 3
rd
class passengers. How
different are these survival rates?
The following graph shows the survivial rate (i.e., the percentage that survived) for each
passenger class.
Questions
13. What is the exact height of the 1
st
Class bar? That is, what values were used to construct
this plot?
14. What pattern(s) do you see in the plot? Discuss.
Handout 7: Understanding Relationships Between Two Categorical Variables
STAT 100 – Spring 2016
10
When Excel is used to create a chart based on the PivotTable results, it includes all of the row
percentages in the chartnot just the row percentages for those that survived.
Questions:
15. On the above plot, sketch the appropriate bar for the proportion that died for the 2
nd
class and 3
rd
class passengers. What pattern(s) is/are present when the proportion that
died is considered across the passengers classes? Discuss.
The graph shown above is typically called a side-by-side bar chart (Excel calls it a clustered
column chart). Such charts are used very often to summarize relationships between two
categorical variables, but we can do better than this. One suggestion to improve this graphical
summary of the relationship between Passenger Class and Survival is to stack the bars for each
passenger class instead of using the side-by-side bars.
Handout 7: Understanding Relationships Between Two Categorical Variables
STAT 100 – Spring 2016
11
Creating 100% Stacked Column Charts in Excel
To create this graph in Excel, first highlight the cells containing your PivotTable. Then, select
Insert > Column Chart > 100% Stacked Column.
Excel returns the following:
Handout 7: Understanding Relationships Between Two Categorical Variables
STAT 100 – Spring 2016
12
To remove some of the white space between the columns, select one of the bars, right-click, and
select Format Data Series… In the Format Data Series window, you can reduce the Gap width.
You can also right-click and “Add Data Labels” to display the appropriate row percentages on
your chart. Finally, you can add a title if desired using the Chart Layouts menu.
The resulting chart is shown below:
Investigating Survival Status across Gender
Next, consider the relationship between Survival and Gender.
Tasks:
1. Use the PivotTable feature in Excel to obtain the following table.
Gender
Survived
Total
No
Female
127
466
Male
682
843
Total
809
1309
Handout 7: Understanding Relationships Between Two Categorical Variables
STAT 100 – Spring 2016
13
2. Obtain the appropriate row percentages using Excel. Write these in the table below
Gender
Survived
Total
No
Yes
Female 100%
Male 100%
Total 100%
3. Create a 100% Stacked Column Chart using the row percentages computed above.
Sketch in your result below.
Questions:
1. What was the survival rate of Females on the Titanic?
2. What was the survival rate of Males on the Titanic?
3. How different were the survival rates across Gender? Discuss.
Handout 7: Understanding Relationships Between Two Categorical Variables
STAT 100 – Spring 2016
14
Considering a Third Variable
The previous investigation considered the relationship between Survival and Gender. What
happens if we also consider whether the passenger was a child or an adult? To investigate this,
simply drag the Child/Adult variable to the Report Filter box.
Using the Filter in either the Pivot table or chart, you can select to see information for the
children and adults separately. Do this to complete the following table:
Survival Rates
Adult
Child
Female
Male
Female
Male
You can also examine the relationships between variables in the following plots.
Handout 7: Understanding Relationships Between Two Categorical Variables
STAT 100 – Spring 2016
15
Questions:
4. Does the relationship between Gender and Survival change based on whether the
passenger was a child or an adult? Explain.
Example: Political Affiliation and Opinions on Immigration Reform
Recall the example from Handout 7. A recent CNN/ORC opinion poll surveyed a random
sample of 1,010 adults nationwide. Respondents were asked the following question: “What
should be the main focus of the U.S. government in dealing with the issue of illegal
immigration: developing a plan that would allow illegal immigrants who have jobs to become
legal U.S. residents, or developing a plan for stopping the flow of illegal immigrants into the
U.S. and for deporting those already here?”
The data from this opinion poll are reported in the following contingency table:
Legal residency
Stop flow/deport
Unsure
Totals
Democrat
202
79
12
293
Independent
267
189
29
485
Republican
79
144
9
232
Totals
548
412
50
1,010
Suppose that the question of interest is as follows: Do opinions on this issue tend to differ across
political affiliation?
Questions:
1. What are the two variables of interest in this study?
2. Which is the dependent variable (i.e., the response)?
3. Which is the independent variable (i.e., the predictor)?
Handout 7: Understanding Relationships Between Two Categorical Variables
STAT 100 – Spring 2016
16
Analyzing Data from an Existing Contingency Table in Excel
In Handout 7, we computed percentages that would be relevant for investigating this question
of interest. Here, our objective is to compute these percentages and to obtain a graphical
summary of the data using Excel.
Note that in terms of how the data were presented to us, there is one big difference between this
example and the Titanic example. In the Titanic example, our Excel file contained the raw data.
In other words, each row in the spreadsheet represented a single subject in the study.
A snippet of the Titanic Data
We were given the raw data for the Titanic example, and we had to create a contingency table
using PivotTables in Excel. We were not provided with raw data for the Immigration Reform
study; instead, the data were given to us in the form of a contingency table. If we were to use
the exact same steps in Excel for the Immigration Reform example, we would need to create an
Excel sheet that was set up so that each row represents the outcomes for a single subject in the
study. For example, we could start with something like this:
This is just a start. There were 1,010 respondents in this study, so we would have to create a
sheet with 1,010 rows. Of these rows, 202 would identify Democrats who said “Legal
residency,” 79 would identify Democrats who said “Stop flow/deport,” etc. Doesn’t sound like
much fun, does it? Instead of going through all of this trouble, let’s take a short-cut. In Excel,
we can enter the data as follows. Essentially, we will give each cell in the contingency table its
own row in Excel. Then, we’ll add a column to our data sheet that tells Excel how many
respondents appeared in each cell.
Handout 7: Understanding Relationships Between Two Categorical Variables
STAT 100 – Spring 2016
17
Enter the data from the contingency table in Excel as shown below.
Now, we can create a PivotTable from these data.
Just like we did with the Titanic example, highlight the cells containing the data, and
insert a PivotTable. Be sure to place the independent variable in the rows and the
dependent variable in the columns.
Here is the difference between this example and the Titanic example: when we are
working with data that have already been summarized with counts in Excel (i.e., when
each row in the Excel file represents a cell in a contingency table instead of a single
subject in the study), then place the Count in the Values box.
Handout 7: Understanding Relationships Between Two Categorical Variables
STAT 100 – Spring 2016
18
You should see the following PivotTable.
Just like we did with the Titanic Data, we can look at relevant Row Percentages instead of
counts:
Finally, we can create the 100% stacked column chart. Do this in Excel, and sketch the results
below.
Questions:
4. Use the information in this plot to write a sentence or two addressing the question of
interest in this study: Do opinions on this issue tend to differ across political affiliation?