The Life-Changing Magic of Tidying Text
Apr 29, 2016
16 minute read

When I went to the rOpenSci unconference about a month ago, I started work with Dave Robinson on a package for text mining using tidy data principles. What is this tidy data you keep hearing so much about? As described by Hadley Wickham, tidy data has a specific structure:

  • each variable is a column
  • each observation is a row
  • each type of observational unit is a table

This means we end up with a data set that is in a long, skinny format instead of a wide format. Tidy data sets are easier to work with, and this is no less true when one starts to work with text. Most of the tooling and infrastructure needed for text mining with tidy data frames already exists in packages like dplyr, broom, tidyr, and ggplot2. Our goal in writing the tidytext package is to provide functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages. We got a great start on our work when we were at the unconference and recently finished getting the first release ready; this week, tidytext was released on CRAN!

Jane Austen’s Novels Can Be So Tidy

One of the important functions we knew we needed was something to unnest text by some chosen token. In my previous blog posts, I did this with a for loop first and then with a function that involved several dplyr bits. In the tidytext package, there is a function unnest_tokens which has this functionality; it restructures text into a one-token-per-row format. This function is a way to convert a dataframe with a text column to be one-token-per-row. Let’s look at an example using Jane Austen’s novels.

(What?! Surely you’re not tired of them yet?)

via GIPHY

The janeaustenr package has a function austen_books that returns a tidy dataframe of all of the novels. Let’s use that, annotate a linenumber quantity to keep track of lines in the original format, use a regex to find where all the chapters are, and then unnest_tokens.

library(janeaustenr)
library(tidytext)
library(dplyr)
library(stringr)

original_books <- austen_books() %>%
        group_by(book) %>%
        mutate(linenumber = row_number(),
               chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
                                                 ignore_case = TRUE)))) %>%
        ungroup()

original_books
## Source: local data frame [70,942 x 4]
## 
##                     text                book linenumber chapter
##                    (chr)              (fctr)      (int)   (int)
## 1  SENSE AND SENSIBILITY Sense & Sensibility          1       0
## 2                        Sense & Sensibility          2       0
## 3         by Jane Austen Sense & Sensibility          3       0
## 4                        Sense & Sensibility          4       0
## 5                 (1811) Sense & Sensibility          5       0
## 6                        Sense & Sensibility          6       0
## 7                        Sense & Sensibility          7       0
## 8                        Sense & Sensibility          8       0
## 9                        Sense & Sensibility          9       0
## 10             CHAPTER 1 Sense & Sensibility         10       1
## ..                   ...                 ...        ...     ...
tidy_books <- original_books %>%
        unnest_tokens(word, text)

tidy_books
## Source: local data frame [724,971 x 4]
## 
##                   book linenumber chapter        word
##                 (fctr)      (int)   (int)       (chr)
## 1  Sense & Sensibility          1       0       sense
## 2  Sense & Sensibility          1       0         and
## 3  Sense & Sensibility          1       0 sensibility
## 4  Sense & Sensibility          3       0          by
## 5  Sense & Sensibility          3       0        jane
## 6  Sense & Sensibility          3       0      austen
## 7  Sense & Sensibility          5       0        1811
## 8  Sense & Sensibility         10       1     chapter
## 9  Sense & Sensibility         10       1           1
## 10 Sense & Sensibility         13       1         the
## ..                 ...        ...     ...         ...

This function uses the tokenizers package to separate each line into words. The default tokenizing is for words, but other options include characters, sentences, lines, paragraphs, or separation around a regex pattern.

Now that the data is in one-word-per-row format, the TIDY DATA MAGIC can happen and we can manipulate it with tidy tools like dplyr. For example, we can remove stop words (kept in the tidytext dataset stop_words) with an anti_join.

data("stop_words")
tidy_books <- tidy_books %>%
        anti_join(stop_words)

Then we can use count to find the most common words in all of Jane Austen’s novels as a whole.

tidy_books %>%
        count(word, sort = TRUE)
## Source: local data frame [13,896 x 2]
## 
##      word     n
##     (chr) (int)
## 1    miss  1854
## 2    time  1337
## 3   fanny   862
## 4    dear   822
## 5    lady   817
## 6     sir   806
## 7     day   797
## 8    emma   787
## 9  sister   727
## 10  house   699
## ..    ...   ...

Sentiment analysis can be done as an inner join. Three sentiment lexicons are in the tidytext package in the sentiment dataset. Let’s examine how sentiment changes changes during each novel. Let’s find a sentiment score for each word using the Bing lexicon, then count the number of positive and negative words in defined sections of each novel.

library(tidyr)
bing <- sentiments %>%
        filter(lexicon == "bing") %>%
        select(-score)

bing
## Source: local data frame [6,788 x 3]
## 
##           word sentiment lexicon
##          (chr)     (chr)   (chr)
## 1      2-faced  negative    bing
## 2      2-faces  negative    bing
## 3           a+  positive    bing
## 4     abnormal  negative    bing
## 5      abolish  negative    bing
## 6   abominable  negative    bing
## 7   abominably  negative    bing
## 8    abominate  negative    bing
## 9  abomination  negative    bing
## 10       abort  negative    bing
## ..         ...       ...     ...
janeaustensentiment <- tidy_books %>%
        inner_join(bing) %>% 
        count(book, index = linenumber %/% 80, sentiment) %>% 
        spread(sentiment, n, fill = 0) %>% 
        mutate(sentiment = positive - negative)

janeaustensentiment
## Source: local data frame [891 x 5]
## Groups: book, index [891]
## 
##                   book index negative positive sentiment
##                 (fctr) (dbl)    (dbl)    (dbl)     (dbl)
## 1  Sense & Sensibility     0       16       26        10
## 2  Sense & Sensibility     1       19       44        25
## 3  Sense & Sensibility     2       12       23        11
## 4  Sense & Sensibility     3       15       22         7
## 5  Sense & Sensibility     4       16       29        13
## 6  Sense & Sensibility     5       16       39        23
## 7  Sense & Sensibility     6       24       37        13
## 8  Sense & Sensibility     7       22       39        17
## 9  Sense & Sensibility     8       30       35         5
## 10 Sense & Sensibility     9       14       18         4
## ..                 ...   ...      ...      ...       ...

Now we can plot these sentiment scores across the plot trajectory of each novel.

library(ggplot2)
library(viridis)
ggplot(janeaustensentiment, aes(index, sentiment, fill = book)) +
        geom_bar(stat = "identity", show.legend = FALSE) +
        facet_wrap(~book, ncol = 2, scales = "free_x") +
        theme_minimal(base_size = 13) +
        labs(title = "Sentiment in Jane Austen's Novels",
             y = "Sentiment") +
        scale_fill_viridis(end = 0.75, discrete=TRUE, direction = -1) +
        scale_x_discrete(expand=c(0.02,0)) +
        theme(strip.text=element_text(hjust=0)) +
        theme(strip.text = element_text(face = "italic")) +
        theme(axis.title.x=element_blank()) +
        theme(axis.ticks.x=element_blank()) +
        theme(axis.text.x=element_blank())

center

This is similar to some of the plots I have made in previous posts, but the effort and time required to make it is drastically less. More importantly, the thinking required to make it comes much more easily because it all falls so naturally out of joins and other dplyr verbs.

Looking at Units Beyond Words

Lots of useful work can be done by tokenizing at the word level, but sometimes it is useful or necessary to look at different units of text. For example, some sentiment analysis algorithms look beyond only unigrams (i.e. single words) to try to understand the sentiment of a sentence as a whole. These algorithms try to understand that

I am not having a good day.

is a negative sentence, not a positive one, because of negation. The Stanford CoreNLP tools and the sentimentr R package (currently available on Github but not CRAN) are examples of such sentiment analysis algorithms. For these, we may want to tokenize text into sentences.

austen_sentences <- austen_books() %>% 
        group_by(book) %>% 
        unnest_tokens(sentence, text, token = "sentences") %>% 
        ungroup()

Let’s look at just one.

austen_sentences$sentence[39]
## [1] "it would be enough to make them completely easy."

The sentence tokenizing does seem to have a bit of trouble with UTF-8 encoded text, especially with sections of dialogue; it does much better with punctuation in ASCII.

Near the beginning of this vignette, we used a regex to find where all the chapters were in Austen’s novels. We can use tidy text analysis to ask questions such as what are the most negative chapters in each of Jane Austen’s novels? First, let’s get the list of negative words from the Bing lexicon. Second, let’s make a dataframe of how many words are in each chapter so we can normalize for the length of chapters. Then, let’s find the number of negative words in each chapter and divide by the total words in each chapter. Which chapter has the highest proportion of negative words?

bingnegative <- sentiments %>%
        filter(lexicon == "bing", sentiment == "negative")

wordcounts <- tidy_books %>%
        group_by(book, chapter) %>%
        summarize(words = n())

tidy_books %>%
        semi_join(bingnegative) %>%
        group_by(book, chapter) %>%
        summarize(negativewords = n()) %>%
        left_join(wordcounts, by = c("book", "chapter")) %>%
        mutate(ratio = negativewords/words) %>%
        filter(chapter != 0) %>%
        top_n(1)
## Source: local data frame [6 x 5]
## Groups: book [6]
## 
##                  book chapter negativewords words     ratio
##                (fctr)   (int)         (int) (int)     (dbl)
## 1 Sense & Sensibility      29           172  1135 0.1515419
## 2   Pride & Prejudice      34           108   646 0.1671827
## 3      Mansfield Park      45           132   884 0.1493213
## 4                Emma      15           147  1012 0.1452569
## 5    Northanger Abbey      27            55   337 0.1632047
## 6          Persuasion      21           215  1948 0.1103696

These are the chapters with the most negative words in each book, normalized for number of words in the chapter. What is happening in these chapters? In Chapter 29 of Sense and Sensibility Marianne finds out what an awful jerk Willoughby is by letter, and in Chapter 34 of Pride and Prejudice Mr. Darcy proposes for the first time (so badly!). Chapter 45 of Mansfield Park is almost the end, when Tom is sick with consumption and Mary is revealed as all greedy and a gold-digger, Chapter 15 of Emma is when horrifying Mr. Elton proposes, and Chapter 27 of Northanger Abbey is a short chapter where Catherine gets a terrible letter from her inconstant friend Isabella. Chapter 21 of Persuasion is when Anne’s friend tells her all about Mr. Elliott’s immoral past.

Interestingly, many of those chapters are very close to the ends of the novels; things tend to get really bad for Jane Austen’s characters before their happy endings, it seems. Also, these chapters largely involve terrible revelations about characters through letters or conversations about past events, rather than some action happening directly in the plot. All that, just with dplyr verbs, because the data is tidy.

Networks of Words

Another function in tidytext is pair_count, which counts pairs of items that occur together within a group. Let’s count the words that occur together in the lines of Pride and Prejudice.

pride_prejudice_words <- tidy_books %>%
        filter(book == "Pride & Prejudice")
word_cooccurences <- pride_prejudice_words %>%
        pair_count(linenumber, word, sort = TRUE)
word_cooccurences
## Source: local data frame [50,550 x 3]
## 
##       value1  value2     n
##        (chr)   (chr) (dbl)
## 1  catherine    lady    87
## 2    bingley    miss    68
## 3     bennet    miss    65
## 4      darcy    miss    46
## 5    william     sir    35
## 6     bourgh      de    32
## 7  elizabeth    miss    29
## 8  elizabeth    jane    27
## 9  elizabeth   cried    24
## 10   forster colonel    24
## ..       ...     ...   ...

This can be useful, for example, to plot a network of co-occuring words with the igraph and ggraph packages.

library(igraph)
library(ggraph)

set.seed(1813)
word_cooccurences %>%
        filter(n >= 10) %>%
        graph_from_data_frame() %>%
        ggraph(layout = "fr") +
        geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
        geom_node_point(color = "darkslategray4", size = 5) +
        geom_node_text(aes(label = name), vjust = 1.8) +
        ggtitle(expression(paste("Word Network in Jane Austen's ", 
                                 italic("Pride and Prejudice")))) +
        theme_void()

center

Ten/five/whatever thousand pounds a year!

Let’s do another one!

pride_prejudice_words <- tidy_books %>%
        filter(book == "Emma")
word_cooccurences <- pride_prejudice_words %>%
        pair_count(linenumber, word, sort = TRUE)
set.seed(2016)
word_cooccurences %>%
        filter(n >= 10) %>%
        graph_from_data_frame() %>%
        ggraph(layout = "fr") +
        geom_edge_link(aes(edge_alpha = n, edge_width = n)) +
        geom_node_point(color = "plum4", size = 5) +
        geom_node_text(aes(label = name), vjust = 1.8) +
        ggtitle(expression(paste("Word Network in Jane Austen's ", 
                                 italic("Emma")))) +
        theme_void()

center

Lots of proper nouns are showing up in these network plots (Box Hill, Frank Churchill, Lady Catherine de Bourgh, etc.), and it is easy to pick out the main characters (Elizabeth, Emma). This type of network analysis is mainly showing us the important people and places in a text, and how they are related.

Word Frequencies

A common task in text mining is to look at word frequencies and to compare frequencies across different texts. We can do this using tidy data principles pretty smoothly. We already have Jane Austen’s works; let’s get two more sets of texts to compare to. Dave has just put together a new package to search and download books from Project Gutenberg through R; we’re going to use that because this is a better way to follow Project Gutenberg’s rules for robot access. And it is SO nice to use! First, let’s look at some science fiction and fantasy novels by H.G. Wells, who lived in the late 19th and early 20th centuries. Let’s get The Time Machine, The War of the Worlds, The Invisible Man, and The Island of Doctor Moreau.

library(gutenbergr)
hgwells <- gutenberg_download(c(35, 36, 5230, 159))
tidy_hgwells <- hgwells %>%
        unnest_tokens(word, text) %>%
        anti_join(stop_words)

Just for kicks, what are the most common words in these novels of H.G. Wells?

tidy_hgwells %>%
        count(word, sort = TRUE)
## Source: local data frame [11,769 x 2]
## 
##      word     n
##     (chr) (int)
## 1    time   454
## 2  people   302
## 3    door   260
## 4   heard   249
## 5   black   232
## 6   stood   229
## 7   white   222
## 8    hand   218
## 9    kemp   213
## 10   eyes   210
## ..    ...   ...

Now let’s get some well-known works of the Brontë sisters, whose lives overlapped with Jane Austen’s somewhat but who wrote in a bit of a different style. Let’s get Jane Eyre, Wuthering Heights, The Tenant of Wildfell Hall, Villette, and Agnes Grey.

bronte <- gutenberg_download(c(1260, 768, 969, 9182, 766))
tidy_bronte <- bronte %>%
        unnest_tokens(word, text) %>%
        anti_join(stop_words)

What are the most common words in these novels of the Brontë sisters?

tidy_bronte %>%
        count(word, sort = TRUE)
## Source: local data frame [25,714 x 2]
## 
##      word     n
##     (chr) (int)
## 1    time  1586
## 2    miss  1388
## 3    hand  1239
## 4     day  1136
## 5    eyes  1023
## 6   night  1011
## 7   house   960
## 8    head   957
## 9  looked   949
## 10   aunt   896
## ..    ...   ...

Well, Jane Austen is not going around talking about people’s HEARTS this much; I can tell you that right now. Those Brontë sisters, SO DRAMATIC. Interesting that “time” and “door” are in the top 10 for both H.G. Wells and the Brontë sisters. “Door”?!

Anyway, let’s calculate the frequency for each word for the works of Jane Austen, the Brontë sisters, and H.G. Wells.

tidy_both <- bind_rows(
        mutate(tidy_bronte, author = "Brontë Sisters"),
        mutate(tidy_hgwells, author = "H.G. Wells"))
frequency <- tidy_both %>%
        mutate(word = str_extract(word, "[a-z]+")) %>%
        count(author, word) %>%
        rename(other = n) %>%
        inner_join(count(tidy_books, word)) %>%
        rename(Austen = n) %>%
        mutate(other = other / sum(other),
               Austen = Austen / sum(Austen)) %>%
        ungroup()

I’m using str_extract here because the UTF-8 encoded texts from Project Gutenberg have some examples of words with underscores around them to indicate emphasis (you know, like italics). The tokenizer treated these as words but I don’t want to count “_any_” separately from “any”. Now let’s plot.

library(scales)
ggplot(frequency, aes(x = other, y = Austen, color = abs(Austen - other))) +
        geom_abline(color = "gray40") +
        geom_jitter(alpha = 0.1, size = 2.5, width = 0.4, height = 0.4) +
        geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
        scale_x_log10(labels = percent_format()) +
        scale_y_log10(labels = percent_format()) +
        scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
        facet_wrap(~author, ncol = 2) +
        theme_minimal(base_size = 14) +
        theme(legend.position="none") +
        labs(title = "Comparing Word Frequencies",
             subtitle = "Word frequencies in Jane Austen's texts are closer to the Brontë sisters' than to H.G. Wells'",
             y = "Jane Austen", x = NULL)

center

Words that are close to the line in these plots have similar frequencies in both sets of texts, for example, in both Austen and Brontë texts (“miss”, “time”, “lady”, “day” at the upper frequency end) or in both Austen and Wells texts (“time”, “day”, “mind”, “brother” at the high frequency end). Words that are far from the line are words that are found more in one set of texts than another. For example, in the Austen-Brontë plot, words like “elizabeth”, “emma”, “captain”, and “bath” (all proper nouns) are found in Austen’s texts but not much in the Brontë texts, while words like “arthur”, “dark”, “dog”, and “doctor” are found in the Brontë texts but not the Austen texts. In comparing H.G. Wells with Jane Austen, Wells uses words like “beast”, “guns”, “brute”, and “animal” that Austen does not, while Austen uses words like “family”, “friend”, “letter”, and “agreeable” that Wells does not.

Overall, notice that the words in the Austen-Brontë plot are closer to the zero-slope line than in the Austen-Wells plot and also extend to lower frequencies; Austen and the Brontë sisters use more similar words than Austen and H.G. Wells. Also, you might notice the percent frequencies for individual words are different in one plot when compared to another because of the inner join; not all the words are found in all three sets of texts so the percent frequency is a different quantity.

Let’s quantify how similar and different these sets of word frequencies are using a correlation test. How correlated are the word frequencies between Austen and the Brontë sisters, and between Austen and Wells?

cor.test(data = frequency[frequency$author == "Brontë Sisters",], ~ other + Austen)
## 
## 	Pearson's product-moment correlation
## 
## data:  other and Austen
## t = 122.45, df = 10611, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7572399 0.7730119
## sample estimates:
##       cor 
## 0.7652408
cor.test(data = frequency[frequency$author == "H.G. Wells",], ~ other + Austen)
## 
## 	Pearson's product-moment correlation
## 
## data:  other and Austen
## t = 36.043, df = 5958, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4020291 0.4437216
## sample estimates:
##       cor 
## 0.4230993

The relationship between the word frequencies is different between these sets of texts, as it appears in the plots.

The End

There is another whole set of functions in tidytext for converting to and from document-term matrices. Many existing text mining data sets are in document-term matrices, or you might want such a matrix for a specific machine learning application. The tidytext package has tidy functions for objects from the tm and quanteda packages so you can convert back and forth. (For more on the tidy verb, see the broom package). This allows, for example, a workflow with easy reading, filtering, and processing to be done using dplyr and other tidy tools, after which the data can be converted into a document-term matrix for machine learning applications. For examples of working with objects from other text mining packages using tidy data principles, see the tidytext vignette on converting to and from document-term matrices.

Many thanks to rOpenSci for hosting the unconference where we started work on the tidytext package, and to Gabriela de Queiroz, who contributed to the package while we were at the unconference. I am super happy to have collaborated with Dave; it has been a delightful experience. The R Markdown file used to make this blog post is available here. I am very happy to hear feedback or questions!



comments powered by Disqus