DataViz in R: Part 1

These materials come from an interactive workshop I lead at the Emory Center for Digital Scholarship.

Part 1 includes:
1. Orientation to RStudio
2. Basics of ggplot
3. Changing size, shape, and alpha (opacity)
4. Three ways of specifying color
5. Putting it all together
6. Exercises 1 and 2

### 1. Orientation to RStudio

• If you are new to R or RStudio, there are a lot of introductory courses and materials online. DataCamp.com and Lynda.com offer professionally curated content, while YouTube and other sites will cover just about everything if you look hard enough.

• In short, RStudio provides a clean and intuitive interface for working in R. The opening dashboard contains four panes. Open a new R script using the top left-most button. Opening in the top left pane of RStudio, this is a saveable version of your R code. When you run commands, they will appear in the R console- the bottom left pane. In the top right pane you’ll find an overview of your workspace environment and history. In the bottom right you’ll find many helpful things: your plots, your packages, and the help menu. More information available here.

• For now, here are some of my favorite shortcuts to get you started. (Commands here are for Mac OSX)

Preferences > Code > “Soft Wrap R source files”
…This will make your R script easier to read by wrapping the text to fit your window.

Preferences > Appearance
…This allows you to customize the font / size / color of your R scripts and console.

Place a # before any line to make notes to yourself
…R will ignore anything that comes after the # sign. Take extensive notes in your code to remember what you did!

Command + Enter = Run
…Place your cursor on the line you want to run and press “command” and “enter / return” to run this line of code.

Option + “-” = <-
…To create the assignment operator (<-), use the ‘option’ + “-” keys.

Control + L = clear R console
…This is helpful if you want a fresh console to work in.

Clears workspace: rm(list=ls())
…To clear everything in your R session (console, variables, datasets, etc.) This doesn’t clear your R script.

See Help > Keyboard Shortcuts for more

As an object-oriented language, in R you can save things to variables. Below, we’ve saved “5” to the variable x using the <- assignment operator. We’ve also saved a list from 1 to 1,000 in the variable y using c to combine or ‘concatenate.’ Lastly, we’ve generated a random normal distribution and saved it to the variable z. This is done with the rnorm(N, Mean, SD) command.

x <- 5
y <- c(1:1000)
z <- rnorm(1000, 500, 100)

We can then use these variables to run analyses. Below we’ve produced a simple histogram of our z variable using the hist() command.

hist(z)

Nothing fancy, but we’ll get into more features later. We can also make a scatter plot of our two variables y and z. Using the plot() command, R guesses the best graphing option, but we can specify a scatterplot using the type=p option: plot(y,z, type="p"). Placing the ? before a command pulls up the R help files: ?plot

plot(y,z)

Similarly bare-bones, but a quick way to visualize on the fly!

One of the biggest advantages of R is the extensive package library. These free installations dramatically widen R’s functionality. Check out some useful packages here.

Let’s install the psych and ggplot2 packages. Once they are installed on your computer, you will need to “require” them at the beginning of each new R session. For this reason, I recommend putting a block of code at the beginning of your script with all the packages required. Subsequent lessons will include a package install header at the beginning. (You may see other tutorials using library() in place of require(), but for our purposes they are quivalent.)

# install.packages("psych")
require("psych")

# install.packages("ggplot2")
require("ggplot2")

Load the diamonds dataset that comes with ggplot2: data(diamonds).
This is a unique command for built-in datasets. View dataset by typing diamonds. This lists the first ten rows. (Also see enivironment view in upper right).

From the psych package, the describe() command gives a summary of your dataset:

describe(diamonds) 
##          vars     n    mean      sd  median trimmed     mad   min      max
## carat       1 53940    0.80    0.47    0.70    0.73    0.47   0.2     5.01
## cut*        2 53940    3.90    1.12    4.00    4.04    1.48   1.0     5.00
## color*      3 53940    3.59    1.70    4.00    3.55    1.48   1.0     7.00
## clarity*    4 53940    4.05    1.65    4.00    3.91    1.48   1.0     8.00
## depth       5 53940   61.75    1.43   61.80   61.78    1.04  43.0    79.00
## table       6 53940   57.46    2.23   57.00   57.32    1.48  43.0    95.00
## price       7 53940 3932.80 3989.44 2401.00 3158.99 2475.94 326.0 18823.00
## x           8 53940    5.73    1.12    5.70    5.66    1.38   0.0    10.74
## y           9 53940    5.73    1.14    5.71    5.66    1.36   0.0    58.90
## z          10 53940    3.54    0.71    3.53    3.49    0.85   0.0    31.80
##             range  skew kurtosis    se
## carat        4.81  1.12     1.26  0.00
## cut*         4.00 -0.72    -0.40  0.00
## color*       6.00  0.19    -0.87  0.01
## clarity*     7.00  0.55    -0.39  0.01
## depth       36.00 -0.08     5.74  0.01
## table       52.00  0.80     2.80  0.01
## price    18497.00  1.62     2.18 17.18
## x           10.74  0.38    -0.62  0.00
## y           58.90  2.43    91.20  0.00
## z           31.80  1.52    47.08  0.00

The variables with asteriks(*) are identified as potentially non-numeric, so be careful when working with these summary features.

### 2. Grammar of Graphics: ggplot2

Under Hadley Wickham’s philosophy, all visualizations can be broken down into three components: data, a coordinate system, and geoms (for ‘geometric objects’). If you want to read more, check out this excellent article where he explains this ‘grammar of graphics’ and the development of the ggplot2 package. More importantly, save this excellent cheat sheet as a resource for making plots with ggplot2.

As a layered approach, ggplot2 works by building components step by step. To understand this logic, lets look at each step incrementally before combining them all. This first command calls the ‘diamonds’ dataset. We can run it, but it won’t look like much yet. Importantly though, it doesn’t produce an error message, which means it’s just an empty visualization.

ggplot(diamonds)

Once we specify the variables we want to plot, R can fill out our coordinate system (or axes). Use the aes() code to refer to your graph’s aesthetics, in this case, the x and y axis.

ggplot(diamonds, aes(x=carat, y=price))

Finally, we add the geometric object- the visualization itself! In this case, points. Notice how this command is separate from our initial command and joined with a + sign.

ggplot(diamonds, aes(x=carat, y=price)) + geom_point()

We can add labels with the labs() command:

ggplot(diamonds, aes(x=carat, y=price)) + geom_point() + labs(x="Carat", y="Price", title="Figure 1: Diamond Prices")

Our code is already spilling onto multiple lines. To keep things neat and readable, try to break up your code into multiple lines. You can break anywhere, as long as each line ends with “+”. This signals that there is more code to read on the next line.

Additionally, we can save this code to an object, graph1 for example, so we don’t have to keep retyping everything. Once saved, we simply type graph1 and R will display our graph.

graph1 <- ggplot(diamonds, aes(x=carat, y=price)) +
geom_point() +
labs(x="Carat", y="Price", title="Figure 1: Diamond Prices")
graph1

Once saved as an object, you can add layers and other features to this object. This makes editing less cumbersome. Try adding a few of these graph themes to our plot.

graph1 + theme_bw()
graph1 + theme_gray() # This is default
graph1 + theme_dark()
graph1 + theme_classic()
graph1 + theme_light()
graph1 + theme_linedraw()
graph1 + theme_minimal()
graph1 + theme_void()

### 3. Changing size, shape, alpha

Within the geom_point() command there are a lot of options for specifying your graph. Here are a couple.

In this case, the size of the points are 7x larger than the default. Try plugging in values like 0.01, 0.5, 3, 10, etc.

ggplot(diamonds, aes(x=carat, y=price)) + geom_point(size=7)

You can also specify the shape of your points. Try searching online for “R plot symbols” to find these codes. In this case I’ve chosen 8, a large asterik symbol.

ggplot(diamonds, aes(x=carat, y=price)) + geom_point(shape=8)

With a large scatterplot like ours, it’s helpful to change the alpha, or opacity, of our points. This adds a three dimensional quality to see through large datasets and get around issues of overplotting. Alpha can range from 0 (fully translucent) to 1 (fully opaque).

ggplot(diamonds, aes(x=carat, y=price)) + geom_point(alpha=0.2)

### 4. Changing Color: Three ways

There are at least three ways to specify color in R. First, you can use simple color names. R has an extensive color library and most of them are intuitively named. Try searching for “colors in R” online.

ggplot(diamonds, aes(x=carat, y=price)) + geom_point(color="red4")

Second, you can use RGB codes. Like a painter’s wheel, this specifies the proportion of red, green, and blue used to create your color. If they’re not expressed in proportions, you will have to set the maximum (usually this is 256). These codes can be found online for certain brands or sports teams. The second example uses RGB codes found online for “Facebook blue.”

ggplot(diamonds, aes(x=carat, y=price)) + geom_point(color=rgb(.5,.7,.6))

ggplot(diamonds, aes(x=carat, y=price)) + geom_point(color=rgb(67,96,156, max=256))

Lastly, you can use hexadecimal codes. These are like RGB codes but act more like a barcode for a specific shade. These are also easy to look up online if you want a specific color. This example uses the code found online for “Minnesota Viking purple”.

ggplot(diamonds, aes(x=carat, y=price)) + geom_point(color="#582C81")

### Putting it all together…

Using all the commands we’ve learned so far, the following graph has specifications for color, size, shape, and alpha.

ggplot(diamonds, aes(x=carat, y=price)) +
geom_point(color="#582C81", size=4, shape=9, alpha=0.2)

### EXERCISE 1

Use the “mpg” dataset (built-in) to make a scatterplot of “city mpg” and “highway mpg” (two quantitative variables in the dataset). Change axes labels and give your plot a title. Change the points to dark blue triangles and change size to 6 and opacity to 0.3. Save this plot as an object “ex1”. Finally, export as a jpg using the ggsave() command. (Google this for more info).

For help:

?mpg
describe(mpg)
?geom_point

When you are done…
Why are some triangles darker than others? We can introduce random variation in our data using the ‘jitter’ command to reveal overlapping data points: geom_point(…., position=‘jitter’, …)

### EXERCISE 2

Make a new scatterplot using the same variables, axes labels, and title. This time, use the jitter command described above. Additionally, make these filled triangles with size of 6 and opacity of 0.5. Lastly, make these triangles the same color red as the “Atlanta Falcons”. (Search online for the hex code or rgb value). Save this plot as an object “ex2”. Export as a jpg.