tidytext 0.1.3
Jun 18, 2017
5 minute read

I am pleased to announce that tidytext 0.1.3 is now on CRAN!

In this release, my collaborator David Robinson and I have fixed a handful of bugs, added tidiers for LDA models from the mallet package, and updated functions for changes to quanteda’s API. You can check out the NEWS for more details on changes.

One enhancement in this release is the addition of the Loughran and McDonald sentiment lexicon of words specific to financial reporting. Sentiment lexicons are lists of words that are used to assess the emotion or opinion content of text by adding up the sentiment scores of individual words within that text; the tidytext package contains three general purpose English sentiment lexicons. The positive or negative meaning of a word can depend on its context, though. A word like “risk” has a negative meaning in most general contexts but may be more neutral for financial reporting. Context-specific sentiment lexicons like the Loughran-McDonald dictionary provide a way to deal with this.

This financial lexicon labels words with six possible sentiments.

library(tidytext)
library(tidyverse)

get_sentiments("loughran") %>%
    count(sentiment, sort = TRUE)
## # A tibble: 6 x 2
##      sentiment     n
##          <chr> <int>
## 1     negative  2355
## 2    litigious   903
## 3     positive   354
## 4  uncertainty   297
## 5 constraining   184
## 6  superfluous    56

An example

I recently saw a sentiment analysis by Michael Toth of Warren Buffett’s letters to shareholders. It’s a super interesting analysis, done well, but we can see from some of the plots in that analysis that the specifically financial nature of these documents would make a financial sentiment lexicon a great choice. Let’s scrape the letters from Berkshire Hathaway, Warren Buffett’s company, and then implement a sentiment analysis using this new lexicon.

library(rvest)
library(pdftools)

urls_oldest <- paste0("http://www.berkshirehathaway.com/letters/", 
                     seq(1977, 1997), ".html")
html_urls <- c(urls_oldest,
               "http://www.berkshirehathaway.com/letters/1998htm.html",
               "http://www.berkshirehathaway.com/letters/1999htm.html",
               "http://www.berkshirehathaway.com/2000ar/2000letter.html",
               "http://www.berkshirehathaway.com/2001ar/2001letter.html")

letters_html <- html_urls %>%
    map_chr(~ read_html(.) %>% 
                html_text())

urls_newest <- paste0("http://www.berkshirehathaway.com/letters/", 
                      seq(2003, 2016), "ltr.pdf")

pdf_urls <- c("http://www.berkshirehathaway.com/letters/2002pdf.pdf",
              urls_newest)

letters_pdf <- pdf_urls %>%
    map_chr(~ pdf_text(.) %>% paste(collapse = " "))

letters <- data_frame(year = seq(1977, 2016),
                      text = c(letters_html, letters_pdf))

Now we have the letters, and can convert this to a tidy text format.

tidy_letters <- letters %>%
    unnest_tokens(word, text) %>%
    add_count(year) %>%
    rename(year_total = n)

tidy_letters
## # A tibble: 486,560 x 3
##     year         word year_total
##    <int>        <chr>      <int>
##  1  1977   chairman's       3063
##  2  1977       letter       3063
##  3  1977         1977       3063
##  4  1977    berkshire       3063
##  5  1977     hathaway       3063
##  6  1977          inc       3063
##  7  1977           to       3063
##  8  1977          the       3063
##  9  1977 stockholders       3063
## 10  1977           of       3063
## # ... with 486,550 more rows

Next, let’s implement the sentiment analysis.

letter_sentiment <- tidy_letters %>%
    inner_join(get_sentiments("loughran"))

letter_sentiment
## # A tibble: 20,921 x 4
##     year        word year_total   sentiment
##    <int>       <chr>      <int>       <chr>
##  1  1977      better       3063    positive
##  2  1977 anticipated       3063 uncertainty
##  3  1977       gains       3063    positive
##  4  1977       gains       3063    positive
##  5  1977      losses       3063    negative
##  6  1977       gains       3063    positive
##  7  1977      losses       3063    negative
##  8  1977 anticipated       3063 uncertainty
##  9  1977   indemnity       3063   litigious
## 10  1977      better       3063    positive
## # ... with 20,911 more rows

Now we have all we need to see the relative changes in these sentiments over the years.

letter_sentiment %>%
    count(year, year_total, sentiment) %>%
    filter(sentiment %in% c("positive", "negative", 
                            "uncertainty", "litigious")) %>%
    mutate(sentiment = factor(sentiment, levels = c("negative",
                                                    "positive",
                                                    "uncertainty",
                                                    "litigious"))) %>%
    ggplot(aes(year, n / year_total, fill = sentiment)) +
    geom_area(position = "identity", alpha = 0.5) +
    labs(y = "Relative frequency", x = NULL,
         title = "Sentiment analysis of Warren Buffett's shareholder letters",
         subtitle = "Using the Loughran-McDonald lexicon")

We see negative sentiment spiking, higher than positive sentiment, during the financial upheaval of 2008, the collapse of the dot-com bubble in the early 2000s, and the recession of the 1990s. Overall, though, notice that the balance of positive to negative sentiment is not as skewed to positive as when you use one of the general purpose sentiment lexicons.

This happens because of the words that are driving the sentiment score in these different cases. When using the financial sentiment lexicon, the words have specifically been chosen for a financial context. What words are driving these sentiment scores?

letter_sentiment %>%
    count(sentiment, word) %>%
    filter(sentiment %in% c("positive", "negative", 
                            "uncertainty", "litigious")) %>%
    group_by(sentiment) %>%
    top_n(15) %>%
    ungroup %>%
    mutate(word = reorder(word, n)) %>%
    mutate(sentiment = factor(sentiment, levels = c("negative",
                                                    "positive",
                                                    "uncertainty",
                                                    "litigious"))) %>%
    ggplot(aes(word, n, fill = sentiment)) +
    geom_col(alpha = 0.8, show.legend = FALSE) +
    coord_flip() +
    scale_y_continuous(expand = c(0,0)) +
    facet_wrap(~sentiment, scales = "free") +
    labs(x = NULL, y = "Total number of occurrences",
         title = "Words driving sentiment scores in Warren Buffett's shareholder letters",
         subtitle = "From the Loughran-McDonald lexicon")

The End

Checking which words are driving a sentiment score is not only important when dealing with financial text, but all text; using tidy data principles makes it possible and not too difficult to dig into such an analysis. Our upcoming book Text Mining with R explores how applying tidy data principles to this and lots of other text mining tasks makes your time and energy well spent. In the meantime, get the new version of tidytext and let us know on GitHub if you run into any issues!



comments powered by Disqus