John A. Bernau

PhD Candidate, Sociology

github twitter linkedin email
Data Exploration: Where Do American Baptists Live?

In this post, I’ll be using R to download, clean, and plot some basic information about the regional concentration of American denominations using the Baylor Religion Survey of 2010.

To begin, make sure to load a few basic packages.

require("tidyverse")
require("foreign")
require("psych")
require("car")
require("RColorBrewer")

The dataset I will be using comes from the 2010 Baylor Religion Survey (2010), a nationally representative survey of more than 1,500 American adults. It’s hosted on The Association of Religion Data Archives (ARDA). ARDA is a fantastic resource for locating datasets on a number of religious topics. From this link, click on Wave II > Download > Continue. We’ll be using the labeled Stata file. Save this .DTA file somewhere you remember. You’ll also want to have the codebook handy: this shows you the exact wording and responses of the survey questions.

From the foreign package we can use read.dta() to read the Stata .DTA file into our R environment. Make sure to replace the path with the location of your saved file. I use the pipe operator %>% to convert this to a tibble - an popular format for working with data in R, and save it as an object “bay.”"

bay <- read.dta("/Users/johnbernau/Box Sync/1. Desktop/4. Methods/1. R/Baylor Religion 2010/bay3_2010.dta") %>% as.tibble()

Upon first glance (View(bay)), the dataset is quite messy. Let’s remove all the variables we don’t need and save this as “bay2.” A primer on dplyr’s functions can be found here, but the basic format here is select(dataset, variable_to_keep, -variables_to_remove).

bay2 <- select(bay, -(motherl:baylor), -entity, -AUTO_INC, -batch,-LANG1)

Looking at the codebook we can rename some of our key questions with more intuitive names. Q1 asks: “With what religious family, if any, do you most closely identify?” Q3 asks: “How religious do you consider yourself to be?” Q4 asks: “How often do you attend religious services at a place of worship?” The “region” variable groups respondents by East / Midwest / South / West and is already intuitively named.

bay2 <- rename(bay2, rel_family=Q1, how_religious=Q3, attend=Q4)

Now that we have loaded and the data and can recognize our variables, let’s examine the relationship between religious family and region. Format: table(row_variable, column_variable).

table(bay2$rel_family,bay2$region)
##                                            
##                                             East Midwest South West
##   Other                                        8      13    11   14
##   Adventist                                    0       0     0    2
##   African Methodist                            0       2     3    1
##   Assemblies of God                            1       5     4    5
##   Baha'i                                       0       0     1    0
##   Baptist                                     20      48   166   24
##   Bible Church                                 2       0     3    4
##   Brethren                                     1       1     1    0
##   Buddhist                                     3       1     4    5
##   Catholic/Roman Catholic                    111      93   100   86
##   Christian & Missionary Alliance              7       5     2    0
##   Christian Reformed                           1       2     0    0
##   Christian Science                            0       1     0    0
##   Church of Christ                             2       9    18    2
##   Church of God                                3       4     6    1
##   Church of the Nazarene                       1       1     2    1
##   Congregational                               4       1     2    2
##   Disciples of Christ                          0       2     1    1
##   Episcopal/Anglican                          11       5    20    5
##   Hindu                                        1       2     0    0
##   Holiness                                     0       1     5    0
##   Jehovah's Witnesses                          1       2     4    1
##   Jewish                                      11       6     7   10
##   Latter-day Saints                            0       3     3   22
##   Lutheran                                    12      53    24   21
##   Mennonite                                    0       1     0    1
##   Methodist                                   17      37    58   15
##   Muslim                                       0       0     1    1
##   Orthodox (Eastern, Russian, Greek)           2       2     2    3
##   Pentecostal                                  5       9    11    5
##   Presbyterian                                 7      16    22   16
##   Quaker/Friends                               0       0     1    1
##   Reformed Church of America/Dutch Reformed    0       3     1    0
##   Seventh-day Adventist                        0       0     3    2
##   Sikh                                         0       0     0    1
##   Unitarian Universalist                       6       2     2    2
##   United Church of Christ                      4       9     0    2
##   Non-denominational Christian                14      30    63   44
##   No religion [Skip to Q3]                    44      36    41   70
##   Don't know                                   5      11     4    2

We could try to visualize it, but there are too many categories to make sense of right now, so let’s collapse some categories. Saving the result as tab, the following code does three consecutive things. 1) groups by religious family, 2) creates a total variable for the counts of each family, and 3) arranges them in descending order.

tab <- bay2 %>% 
  group_by(rel_family) %>% 
  summarise(total=n()) %>% 
  arrange(desc(total)) 

Let’s plot the top 5 categories and code the rest as “other”. A quick way to do that is to make a list of names

top_5 <- tab$rel_family[1:5]
rest <- tab$rel_family[6:41]

Leaving our tabulation and returning to our full dataset, let’s create a duplicate dataset bay3 to do our recoding.

bay3 <- bay2

Then we’ll say every religious family that is part of our “rest” list, recode to “Other”. The last line cleans up one of our categories to read “None” instead of “No religion [Skip to Q3].”

bay3$rel_family <- recode(bay3$rel_family, ("rest = 'Other'"))
bay3$rel_family <- recode(bay3$rel_family, ("'No religion [Skip to Q3]' = 'None'"))

Doing a quick table, we can double-check out new coding with the original values. The old categories from bay2 will be on the rows, and our new six-category variable will be on the columns.[Output omitted]

table(bay2$rel_family, bay3$rel_family)

Now that our variables are correctly coded / collapsed, we can plot some descriptives. The data we want to plot can be summarized using table(bay3$region, bay3$rel_family) but this is still a bit hard to decipher. The following ggplot specifies the dataset bay3, the variable to plot aes(region), the geometric object to use geom_bar(). This is saved as an object named “final”.

A few other specifications below:

  • aes(fill=rel_family) = color (fill) the bars according to the variable rel_family.

  • stat = "count" = the y-axis for this bar graph will be the raw counts of each category.

  • position = "dodge" = display the bars side by side.

  • scale_fill_brewer() = use the RColorBrewer package to use predefined color palettes. In this case, the “RdYlBu” palette. name = "" removes the legend title.

  • labs() = set main title and axes titles.

final <- ggplot(bay3, aes(region)) + 
  geom_bar(aes(fill=rel_family), stat = "count", position="dodge") +
  scale_fill_brewer(palette= "RdYlBu", name="") + 
  labs(x="", y="", title="US Religion Distribution", subtitle= "Source: Baylor Religion Survey 2010")
final

Maybe we want to order our regional categories in a more geographical order, so West is on the left, and East is on the right. This can be accomplished by reordering the factor levels.

bay3$region <- factor(bay3$region, levels = c("West","Midwest","South", "East"))

Lets do one more round of editing. When collapsing categories into “other” we made a category that dominates most regions of the US, somewhat obscuring our story. Rather than take this data out, let’s use color to de-emphasize it. To do so, let’s add a dark grey to our color scheme.

First, the brewer.pal() command allows you to save a vector of colors. Below we’ve taked 5 colors from the “RdYlBu” palette and saved it as a vector named “colors.” Second, let’s add “grey40” to this vector. Now our colors vector is a list of six colors with the last one being a dark grey color.

colors <- brewer.pal(5, "RdYlBu")
colors <- c(colors, "grey40")

Using this new vector, we’ll copy our first ggplot but specify scale_fill_manual() and pass our colors vector to the values parameter.

final2 <- ggplot(bay3, aes(region)) + 
  geom_bar(aes(fill=rel_family), stat = "count", position="dodge") +
  scale_fill_manual(values=colors, name="") + 
  labs(x="", y="", title="US Religion Distribution", subtitle= "Source: Baylor Religion Survey 2010")
final2

If you want to super-customize your graph, you can save various themes or templates to use with ggplot. Below I’ve created a dark template named johntheme1.

johntheme1 <- theme(text=element_text(family="Times New Roman"), # New font
        plot.title = element_text(hjust = 0.5), # Centered title
        plot.subtitle = element_text(hjust = 0.5), # Centered subtitle
        plot.background = element_rect(fill="black"), # Black background
        panel.background = element_rect(fill="gray20"), # Dark grey panel background
        panel.grid.minor = element_line(color="black"), # Hide grid lines
        panel.grid.major = element_line(color="black"), # Hide grid lines
        axis.text = element_text(color="white"), # Make axis text white
        title = element_text(color="white", face="bold"), # Make title white and bold.
        legend.background = element_rect(fill="black"), # Make legend background black
        legend.text = element_text(color="white"), # Make legend text white
        legend.key = element_rect(fill="black", color="black")) #Squares/borders of legend black

Using our saved graph, we can add our new custom theme.

final2 + johntheme1

A nice graph! Export as a JPG using the ggsave() command.

ggsave(filename="/users/johnbernau/Desktop/Baylor2010.jpg", final2, width=11.25, height=7, units="in")

POSTSCRIPT: There are a few different philosophies when it comes to manipulating your data. Some like to keep a copy of the original in case they have to go back. Others say this clutters the working environment, leading to more mistakes. Neither is right or wrong. A condensed version of the code above would look like this.

# Load packages
require("tidyverse")
require("foreign")
require("psych")
require("car")
require("RColorBrewer")

# Load data
bay <- read.dta("/Users/johnbernau/Box Sync/1. Desktop/4. Methods/1. R/Baylor Religion 2010/bay3_2010.dta") %>% 
  as.tibble()

# Delete unwanted variables and rename. 
bay <- select(bay, -(motherl:baylor), -entity, -AUTO_INC, -batch,-LANG1) %>% 
          rename(rel_family=Q1, how_religious=Q3, attend=Q4)

# Create tabulation of religious families.
tab <- bay %>% 
  group_by(rel_family) %>% 
  summarise(total=n()) %>% 
  arrange(desc(total)) 

# Save vectors of religious families: top 5 and rest.
top_5 <- tab$rel_family[1:5]
rest <- tab$rel_family[6:41]

# Recode variables
bay$rel_family <- recode(bay$rel_family, ("rest = 'Other'"))
bay$rel_family <- recode(bay$rel_family, ("'No religion [Skip to Q3]' = 'None'"))

# Create custom color palette
colors <- brewer.pal(5, "RdYlBu")
colors <- c(colors, "grey40")

# Plot
final2 <- ggplot(bay, aes(region)) + 
  geom_bar(aes(fill=rel_family), stat = "count", position="dodge") +
  scale_fill_manual(values=colors, name="") + 
  labs(x="", y="", title="US Religion Distribution\nBaylor Religion Survey 2010")
final2

Code Home