There are many great resources for learning text analysis in R, including quanteda’s extensive documentation and Silge & Robinson’s Text Mining with R. In this post, I try to keep everything in a familiar format, favoring data frames over “corpus” and “data-feature-matrix” objects. These are undoubtedly necessary for more advanced operations, but tend to confuse those looking to do basic tasks. While I do not cover importing text data, I suggest looking at the readtext package if your data is not in R yet. By the end of this post, you will be able to quickly clean and summarize text in multiple formats.
Raw Text Objects
To start, a block of text…
moby <- "Call me Ishmael. Some years ago —nevermind how long precisely— having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people’s hats off - then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me."
The first thing to do is convert everything to lowercase and remove punctuation, numbers, and problematic whitespaces. A few regular expressions make this quite simple.
gsub() is the “find and replace” of R: the first argument is what to look for, the second argument is what to replace it with, and the third argument is where to look. Take a look at the regular expression / stringr cheat sheets here.
moby <- tolower(moby) # Lowercase moby <- gsub("[[:punct:]]", "", moby) # Remove punctuation moby <- gsub("[[:digit:]]", "", moby) # Remove numbers moby <- gsub("\\s+", " ", str_trim(moby)) # Remove extra whitespaces
Next we’ll split our text block into individual words and calculate a few summary measures: total words, unique words, and “lexical diversity.”
vocab <- unlist(str_split(moby, " ")) # Split into vocab list total_words <- length(vocab) # Total words unique_words <- length(unique(vocab)) # Unique Words lex_div <- unique_words / total_words # Lexical Diversity
With our vocab list handy, we can filter out uninformative stopwords (“the”, “and”, “of”). In this example I’m using the full
stop_words list from the
tidytext package. Using the full list might be a bit liberal as it contains three separate stopword dictionaries. Be careful about these decisions!
vocab_nsw <- vocab[!(vocab) %in% stop_words$word] # Only keep non-stopwords moby_clean <- paste(vocab_nsw, collapse = " ") # Gather back into one string print(moby_clean) # Examining our cleaned text block
##  "call ishmael ago nevermind precisely money purse shore sail watery world driving spleen regulating circulation growing grim mouth damp drizzly november soul involuntarily pausing coffin warehouses bringing rear funeral meet hypos upper hand requires strong moral principle prevent deliberately stepping street methodically knocking peoples hats account time sea substitute pistol ball philosophical flourish cato throws sword quietly ship surprising degree time cherish feelings ocean"
Basic Sentiment Scores
Using pre-defined dictionaries of “sentiment-specific” words, we can quickly analyze a text to get a rough idea of sentiment.
# Load negative words from "bing" sentiment dictionary negative <- get_sentiments("bing") %>% filter(sentiment == "negative") # Only keep negative words vocab_neg <- vocab[(vocab) %in% negative$word] print(vocab_neg)
##  "grim" "involuntarily"
# Load positive words from "bing" sentiment dictionary positive <- get_sentiments("bing") %>% filter(sentiment == "positive") # Only keep positive words vocab_pos <- vocab[(vocab) %in% positive$word] print(vocab_pos)
##  "precisely" "strong" "flourish" "cherish"
Lastly, we can use these results to calculate how many “emotion” words were found in the text and the proportion of these words that were “positive.”
emotion_words <- length(vocab_neg) + length(vocab_pos) proportion_pos <- length(vocab_pos) / emotion_words
Based on this sentiment scoring, the opening paragraph of Moby Dick is 2/3 positive and 1/3 negative. Does this sound accurate as a human reader of the text? Ishmael tells us of his misanthropic fantasies, of the “damp drizzly November in my soul”, and how going to sea is his only option to stave off suicide: “my substitute for pistol and ball.” This type of dictionary-based text analysis is ultimately limited to an atomistic model of language that fails to pick up on fairly basic literary techniques. Recent developments like this one at Stanford have used neural networks to account for more complex sentence structures.
To sum up, after starting with a raw string of text, I removed unwanted characters, calculated a few vocabulary metrics, ran a dictionary-based sentiment analysis, and summarized the results. The following code saves this information as a dataframe.
moby_df <- data.frame(moby, total_words, unique_words, lex_div, moby_clean, vocab_neg = paste(vocab_neg, collapse = " "), vocab_pos = paste(vocab_pos, collapse = " "), emotion_words, proportion_pos, stringsAsFactors = F) glimpse(moby_df)
## Observations: 1 ## Variables: 9 ## $ moby <chr> "call me ishmael some years ago nevermind how l... ## $ total_words <int> 200 ## $ unique_words <int> 133 ## $ lex_div <dbl> 0.665 ## $ moby_clean <chr> "call ishmael ago nevermind precisely money pur... ## $ vocab_neg <chr> "grim involuntarily" ## $ vocab_pos <chr> "precisely strong flourish cherish" ## $ emotion_words <int> 6 ## $ proportion_pos <dbl> 0.6666667
Text within Data Frames
To start, a dataframe with a column for text data…
# Generate some respondent ids and three (identical) text responses id <- c("person1", "person2", "person3") text <- rep("Example text with UPPER CASE, punctuation (!), and numbers (123456789). This should serve us well in our text-cleaning journey. Also, here are some sad negative words and some happy positive words.", 3) # Save as data frame df_original <- data.frame(id, text, stringsAsFactors = FALSE) df <- df_original # Save original dataset and make workhorse "df" glimpse(df)
## Observations: 3 ## Variables: 2 ## $ id <chr> "person1", "person2", "person3" ## $ text <chr> "Example text with UPPER CASE, punctuation (!), and numbe...
df$text <- tolower(df$text) # Lowercase df$text <- gsub("[[:punct:]]", "", df$text) # Remove punctuation df$text <- gsub("[[:digit:]]", "", df$text) # Remove numbers df$text <- gsub("\\s+", " ", str_trim(df$text)) # Remove extra whitespaces df$text # Print
##  "example text with upper case punctuation and numbers this should serve us well in our textcleaning journey also here are some sad negative words and some happy positive words"
Note that by removing punctuation, “textcleaning” was collapsed into one word. This is usually acceptable for most hyphenated words, but to preserve the spacing, just replace all punctuation with a space (" “) and the subsequent line will strip / collapse any extra spaces that result.
Within a dataframe, it is generally easiest to expand the text to one-word-per-row using
vocab <- df %>% unnest_tokens(word, text) head(vocab)
## id word ## 1 person1 example ## 1.1 person1 text ## 1.2 person1 with ## 1.3 person1 upper ## 1.4 person1 case ## 1.5 person1 punctuation
Creating a separate dataframe that captures a few summary measures (total, unique, and lexical diversity). This will come in handy later.
total_unique_ld <- vocab %>% group_by(id) %>% count(word) %>% summarize(total_words = sum(n), unique_words = n(), lex_div = unique_words / total_words) %>% ungroup()
vocab_nsw <- vocab %>% anti_join(stop_words)
Saving this tidy dataframe as new strings by collapsing words (thanks to stackoverflow). This will come in handy later.
cleantext <- nest(vocab_nsw, word) %>% mutate(text = map(data, unlist), text_clean = map(text, paste, collapse = " ")) %>% select(-data, -text)
Basic Sentiment Scores
To get the dictionary-based sentiment results, I first run an
inner-join which retains only the “emotion words” in the
bing dictionary. Then, I generate two things: (1) a string of the positive / negative words that were extracted, and (2) a count of the total emotion words extracted and the proportion positive.
vocab_sent <- vocab_nsw %>% inner_join(get_sentiments("bing")) # 1. Collapse words into positive / negative strings sent_strings <- nest(vocab_sent, word) %>% mutate(text = map(data, unlist), text_clean = map(text, paste, collapse = " ")) %>% select(-data, -text) %>% spread(sentiment, text_clean) %>% rename(vocab_neg = negative, vocab_pos = positive) # 2. Save sentiment summary measures vocab_summary <- vocab_sent %>% group_by(id) %>% count(sentiment) %>% spread(sentiment, n) %>% mutate(emotion_words = negative + positive, positive_prop = positive / emotion_words)
Joining all these ingredients together, our work is complete! Starting with two variables, we have added ten new variables: total words, unique words, lexical diversity, cleaned text, list of negative / positive words, number of negative / positive words, total emotion words extracted, and a proportion of positive words.
df_final <- df_original %>% inner_join(total_unique_ld) %>% inner_join(cleantext) %>% inner_join(sent_strings) %>% inner_join(vocab_summary) glimpse(df_final)
## Observations: 3 ## Variables: 12 ## $ id <chr> "person1", "person2", "person3" ## $ text <chr> "Example text with UPPER CASE, punctuation (!), ... ## $ total_words <int> 29, 29, 29 ## $ unique_words <int> 26, 26, 26 ## $ lex_div <dbl> 0.8965517, 0.8965517, 0.8965517 ## $ text_clean <list> ["text upper punctuation serve textcleaning jou... ## $ vocab_neg <list> ["sad negative", "sad negative", "sad negative"] ## $ vocab_pos <list> ["happy positive", "happy positive", "happy pos... ## $ negative <int> 2, 2, 2 ## $ positive <int> 2, 2, 2 ## $ emotion_words <int> 4, 4, 4 ## $ positive_prop <dbl> 0.5, 0.5, 0.5
Copyright © 2018 John A. Bernau