github twitter linkedin email
Data Exploration: Global Trends in Extreme Poverty 1977-2016

In this post I perform some data exploration in R. I’ll cover the following topics:

More specifically, I’ll be using the World Bank data on extreme poverty to plot global trends of 162 countries between 1977 to 2016. While sociologists talk about inequality at length, it’s useful to remember the state of poverty on a global perspective. Hans Rosling’s recent book (2018) brought this kind of global development data to public attention, and Jeffrey Sachs’ (2006) book has been on my reading list for some time. Stanford’s Center on Poverty and Inequality publishes a great magazine on the state of the field, with all issues available free as PDF or hard copy. For the purpose of my exploration here, the World Bank data measures poverty as the percentage of a country’s population living under the poverty line, considered here $1.90 / day or less in 2011 purchasing power parity (PPP) dollars.

RQ: How is extreme poverty distributed globally?


# Loading required pacakges:
require(tidyverse)

Importing Data

From the World Bank’s site, download the CSV file and unzip. There will be three CSV files and we’ll start with the “API…csv” first. There are four unnecessary rows, so we’ll set the skip option appropriately, and preserve strings as characters.

d <- read.csv("/Users/johnbernau/Box Sync/1. Desktop/6. Methods/1. R/worldbankpoverty/API.csv", skip = 4, stringsAsFactors = FALSE)

Preparing Data

Look at the structure using names(d) and summary(d) and delete unwanted or redundant variables. A quick glimpse shows a lot of missing values.

d <- select(d, -Indicator.Name, -Indicator.Code, -X)
head(d[,1:8], 10)
##            Country.Name Country.Code X1960 X1961 X1962 X1963 X1964 X1965
## 1                 Aruba          ABW    NA    NA    NA    NA    NA    NA
## 2           Afghanistan          AFG    NA    NA    NA    NA    NA    NA
## 3                Angola          AGO    NA    NA    NA    NA    NA    NA
## 4               Albania          ALB    NA    NA    NA    NA    NA    NA
## 5               Andorra          AND    NA    NA    NA    NA    NA    NA
## 6            Arab World          ARB    NA    NA    NA    NA    NA    NA
## 7  United Arab Emirates          ARE    NA    NA    NA    NA    NA    NA
## 8             Argentina          ARG    NA    NA    NA    NA    NA    NA
## 9               Armenia          ARM    NA    NA    NA    NA    NA    NA
## 10       American Samoa          ASM    NA    NA    NA    NA    NA    NA

This first line (borrowed here) removes all columns that have as many NA values as rows. In other words, columns that contain no information. The second line removes any rows that are missing all useful data. Because there are 41 variables, two of which are country info and 39 of which are reporting year, if a row is missing 39 columns, it retains nothing useful for our analysis.

# Remove missing columns / rows
d <- d[,colSums(is.na(d)) < nrow(d)]
d <- d[rowSums(is.na(d)) < 39,]

#Rename some variables for future use.
d <- rename(d, country_long = Country.Name, country_short = Country.Code)

Then, gather years into one column and remove missing rows. Essentially turning columns into rows, the gather() function usually trips people up. I’ve included the head of both dataframes pre- and post-gather for clarity.

dt <- gather(d, "year", "value", X1977:X2016)
head(d[,1:8], 10)
##    country_long country_short X1977 X1979 X1980 X1981 X1982 X1983
## 3        Angola           AGO    NA    NA    NA    NA    NA    NA
## 4       Albania           ALB    NA    NA    NA    NA    NA    NA
## 8     Argentina           ARG    NA    NA   0.2    NA    NA    NA
## 9       Armenia           ARM    NA    NA    NA    NA    NA    NA
## 12    Australia           AUS    NA    NA    NA     1    NA    NA
## 13      Austria           AUT    NA    NA    NA    NA    NA    NA
## 14   Azerbaijan           AZE    NA    NA    NA    NA    NA    NA
## 15      Burundi           BDI    NA    NA    NA    NA    NA    NA
## 16      Belgium           BEL    NA    NA    NA    NA    NA    NA
## 17        Benin           BEN    NA    NA    NA    NA    NA    NA
head(dt, 10)
##    country_long country_short  year value
## 1        Angola           AGO X1977    NA
## 2       Albania           ALB X1977    NA
## 3     Argentina           ARG X1977    NA
## 4       Armenia           ARM X1977    NA
## 5     Australia           AUS X1977    NA
## 6       Austria           AUT X1977    NA
## 7    Azerbaijan           AZE X1977    NA
## 8       Burundi           BDI X1977    NA
## 9       Belgium           BEL X1977    NA
## 10        Benin           BEN X1977    NA

After this, I remove any row with a missing value of our poverty variable. The last preparation is removing the “X” character that preceeds the year variable and convert to a date class. R adds this character to prevent confusion when column names are read as numeric values.

dt <- filter(dt, !is.na(value))
dt$year <- as.Date(gsub("X", "", dt$year), format = "%Y")

A quick plot shows some initial information, but we’re not quite done yet. There are too many countries to interpret and aside from hand-coding, there are no easy ways to sort this out.

ggplot(dt, aes(year, value)) + 
  geom_line(aes(group = country_short))


Merging Data

Luckily, we can use the accompanying metadata file to append region codes to our data. By adding a category of geographic regions (i.e. “North Africa”, “East Asia”) we can start to see larger patterns of geographically grouped countries.

m <- read.csv("/Users/johnbernau/Box Sync/1. Desktop/6. Methods/1. R/worldbankpoverty/Metadata.Country.csv", stringsAsFactors = FALSE)

To join, variables must have exact names. After renaming our country code variable, I join with the initial data. Lastly, we’ll remove any missing Region values and take out North America.

m <- rename(m, country_short = Country.Code)
j <- inner_join(dt, m)
j$Region[j$Region == ""] <- NA
j <- filter(j, !is.na(Region) & Region != "North America")

Plotting Data

Our data is finally ready to plot. Each region gets it’s own facet, and values are colored from red to black for dramatic effect. Make sure to specify the group aesthetic when working with multiple geom_lines. This tells R which observations to connect with a line.

ggplot(j, aes(year, value)) + 
  geom_line(aes(group = country_short, color = value)) +
  scale_color_continuous(low = "black", high = "red", guide = F) +
  facet_wrap(~Region) +
  labs(title = "Percent of Population Living Under $1.90 / Day",
       x = NULL, y = "% Under Poverty Line",
       caption = "\nData: World Bank Extreme Poverty. Adjusted to 2011 PPP.\nJohn A. Bernau 2018 // www.johnabernau.com") +
  theme(text=element_text(size = 14, family="Times New Roman"),
        axis.text = element_text(size = 14),
        panel.background = element_rect(fill="white"),
        panel.grid.minor = element_line(color="grey90"),
        panel.grid.major = element_line(color="grey90"),
        plot.margin = unit(c(0.5,0.5,0.5,0.5), "cm"),
        plot.caption = element_text(size = 9))

ggsave("~/Desktop/wb_poverty.jpg", width = 10, height = 7, units = "in")

We can see that overall, the percentage of people living in extreme poverty is generally declining across most countries; a hopeful conclusion of both the World Bank’s reporting and Hans Rosling’s book. However, a substantial portion of Sub-Saharan Africa remains in a state of extreme poverty. It’s easy to disagree about effective solutions, but there is no denying the existence of a pressing moral problem.

Link to World Bank blog:


Appendix
A less verbose coder could accomplish the preparation in a mere 10 lines of code:

d <- read.csv(file.choose(), skip = 4, stringsAsFactors = FALSE)
d <- select(d, -Indicator.Name, -Indicator.Code, -X, -(X1960:X1976))
d <- rename(d, country_long = Country.Name, country_short = Country.Code)
dt <- gather(d, "year", "value", X1977:X2017) %>% filter(!is.na(value))
dt$year <- as.Date(gsub("X", "", dt$year), format = "%Y")

m <- read.csv(file.choose(), stringsAsFactors = FALSE)
m <- rename(m, country_short = Country.Code)
j <- inner_join(dt, m)
j$Region[j$Region == ""] <- NA
j <- filter(j, !is.na(Region) & Region != "North America")

Code Home