Data Exploration: Global Trends in Extreme Poverty 1977-2016

In this post I perform some data exploration in R. I’ll cover the following topics:

More specifically, I’ll be using the World Bank data on extreme poverty to plot global trends of 162 countries between 1977 to 2016. While sociologists talk about inequality at length, it’s useful to remember the state of poverty on a global perspective. Hans Rosling’s recent book (2018) brought this kind of global development data to public attention, and Jeffrey Sachs’ (2006) book has been on my reading list for some time. Stanford’s Center on Poverty and Inequality publishes a great magazine on the state of the field, with all issues available free as PDF or hard copy. For the purpose of my exploration here, the World Bank data measures poverty as the percentage of a country’s population living under the poverty line, considered here $1.90 / day or less in 2011 purchasing power parity (PPP) dollars. RQ: How is extreme poverty distributed globally? # Loading required pacakges: require(tidyverse) Importing Data From the World Bank’s site, download the CSV file and unzip. There will be three CSV files and we’ll start with the “API…csv” first. There are four unnecessary rows, so we’ll set the skip option appropriately, and preserve strings as characters. d <- read.csv("/Users/johnbernau/Box Sync/1. Desktop/6. Methods/1. R/worldbankpoverty/API.csv", skip = 4, stringsAsFactors = FALSE) Preparing Data Look at the structure using names(d) and summary(d) and delete unwanted or redundant variables. A quick glimpse shows a lot of missing values. d <- select(d, -Indicator.Name, -Indicator.Code, -X) head(d[,1:8], 10) ## Country.Name Country.Code X1960 X1961 X1962 X1963 X1964 X1965 ## 1 Aruba ABW NA NA NA NA NA NA ## 2 Afghanistan AFG NA NA NA NA NA NA ## 3 Angola AGO NA NA NA NA NA NA ## 4 Albania ALB NA NA NA NA NA NA ## 5 Andorra AND NA NA NA NA NA NA ## 6 Arab World ARB NA NA NA NA NA NA ## 7 United Arab Emirates ARE NA NA NA NA NA NA ## 8 Argentina ARG NA NA NA NA NA NA ## 9 Armenia ARM NA NA NA NA NA NA ## 10 American Samoa ASM NA NA NA NA NA NA This first line (borrowed here) removes all columns that have as many NA values as rows. In other words, columns that contain no information. The second line removes any rows that are missing all useful data. Because there are 41 variables, two of which are country info and 39 of which are reporting year, if a row is missing 39 columns, it retains nothing useful for our analysis. # Remove missing columns / rows d <- d[,colSums(is.na(d)) < nrow(d)] d <- d[rowSums(is.na(d)) < 39,] #Rename some variables for future use. d <- rename(d, country_long = Country.Name, country_short = Country.Code) Then, gather years into one column and remove missing rows. Essentially turning columns into rows, the gather() function usually trips people up. I’ve included the head of both dataframes pre- and post-gather for clarity. dt <- gather(d, "year", "value", X1977:X2016) head(d[,1:8], 10) ## country_long country_short X1977 X1979 X1980 X1981 X1982 X1983 ## 3 Angola AGO NA NA NA NA NA NA ## 4 Albania ALB NA NA NA NA NA NA ## 8 Argentina ARG NA NA 0.2 NA NA NA ## 9 Armenia ARM NA NA NA NA NA NA ## 12 Australia AUS NA NA NA 1 NA NA ## 13 Austria AUT NA NA NA NA NA NA ## 14 Azerbaijan AZE NA NA NA NA NA NA ## 15 Burundi BDI NA NA NA NA NA NA ## 16 Belgium BEL NA NA NA NA NA NA ## 17 Benin BEN NA NA NA NA NA NA head(dt, 10) ## country_long country_short year value ## 1 Angola AGO X1977 NA ## 2 Albania ALB X1977 NA ## 3 Argentina ARG X1977 NA ## 4 Armenia ARM X1977 NA ## 5 Australia AUS X1977 NA ## 6 Austria AUT X1977 NA ## 7 Azerbaijan AZE X1977 NA ## 8 Burundi BDI X1977 NA ## 9 Belgium BEL X1977 NA ## 10 Benin BEN X1977 NA After this, I remove any row with a missing value of our poverty variable. The last preparation is removing the “X” character that preceeds the year variable and convert to a date class. R adds this character to prevent confusion when column names are read as numeric values. dt <- filter(dt, !is.na(value)) dt$year <- as.Date(gsub("X", "", dt$year), format = "%Y") A quick plot shows some initial information, but we’re not quite done yet. There are too many countries to interpret and aside from hand-coding, there are no easy ways to sort this out. ggplot(dt, aes(year, value)) + geom_line(aes(group = country_short)) Merging Data Luckily, we can use the accompanying metadata file to append region codes to our data. By adding a category of geographic regions (i.e. “North Africa”, “East Asia”) we can start to see larger patterns of geographically grouped countries. m <- read.csv("/Users/johnbernau/Box Sync/1. Desktop/6. Methods/1. R/worldbankpoverty/Metadata.Country.csv", stringsAsFactors = FALSE) To join, variables must have exact names. After renaming our country code variable, I join with the initial data. Lastly, we’ll remove any missing Region values and take out North America. m <- rename(m, country_short = Country.Code) j <- inner_join(dt, m) j$Region[j$Region == ""] <- NA j <- filter(j, !is.na(Region) & Region != "North America") Plotting Data Our data is finally ready to plot. Each region gets it’s own facet, and values are colored from red to black for dramatic effect. Make sure to specify the group aesthetic when working with multiple geom_lines. This tells R which observations to connect with a line. ggplot(j, aes(year, value)) + geom_line(aes(group = country_short, color = value)) + scale_color_continuous(low = "black", high = "red", guide = F) + facet_wrap(~Region) + labs(title = "Percent of Population Living Under$1.90 / Day",
x = NULL, y = "% Under Poverty Line",
caption = "\nData: World Bank Extreme Poverty. Adjusted to 2011 PPP.\nJohn A. Bernau 2018 // www.johnabernau.com") +
theme(text=element_text(size = 14, family="Times New Roman"),
axis.text = element_text(size = 14),
panel.background = element_rect(fill="white"),
panel.grid.minor = element_line(color="grey90"),
panel.grid.major = element_line(color="grey90"),
plot.margin = unit(c(0.5,0.5,0.5,0.5), "cm"),
plot.caption = element_text(size = 9))

ggsave("~/Desktop/wb_poverty.jpg", width = 10, height = 7, units = "in")

We can see that overall, the percentage of people living in extreme poverty is generally declining across most countries; a hopeful conclusion of both the World Bank’s reporting and Hans Rosling’s book. However, a substantial portion of Sub-Saharan Africa remains in a state of extreme poverty. It’s easy to disagree about effective solutions, but there is no denying the existence of a pressing moral problem.

Appendix
A less verbose coder could accomplish the preparation in a mere 10 lines of code:

d <- read.csv(file.choose(), skip = 4, stringsAsFactors = FALSE)
d <- select(d, -Indicator.Name, -Indicator.Code, -X, -(X1960:X1976))
d <- rename(d, country_long = Country.Name, country_short = Country.Code)
dt <- gather(d, "year", "value", X1977:X2017) %>% filter(!is.na(value))
dt$year <- as.Date(gsub("X", "", dt$year), format = "%Y")

m <- read.csv(file.choose(), stringsAsFactors = FALSE)
m <- rename(m, country_short = Country.Code)
j <- inner_join(dt, m)
j$Region[j$Region == ""] <- NA
j <- filter(j, !is.na(Region) & Region != "North America")

Code Home