Where are #TidyTuesday haunted cemeteries compared to haunted schools?

By Julia Silge in rstats

October 11, 2023

This is the latest in my series of screencasts! It’s spooky season, and this screencast focuses on how to compute weighted log odds using this week’s #TidyTuesday dataset on haunted places in the United States. 👻

Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Explore data

Our analysis goal here is to understand how different types of haunted places in the United States vary by state. Let’s start by reading in the data:

library(tidyverse)
haunted_places <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-10-10/haunted_places.csv')

The description variable is longer text explaining details about the haunting, while the location variable is shorter text describing the haunted location. Which states are these hauntings in?

haunted_places |> 
  count(state, sort = TRUE)

# A tibble: 51 × 2
   state             n
   <chr>         <int>
 1 California     1070
 2 Texas           696
 3 Pennsylvania    649
 4 Michigan        529
 5 Ohio            477
 6 New York        459
 7 Illinois        395
 8 Kentucky        370
 9 Indiana         351
10 Massachusetts   342
# ℹ 41 more rows

And what kinds of places are haunted?

haunted_places |> 
  slice_sample(n = 10) |> 
  select(location)

# A tibble: 10 × 1
   location                  
   <chr>                     
 1 Single T Canal            
 2 Hwy521                    
 3 Santaquin Canyon          
 4 Syracuse Cemetery         
 5 Del Frisco's Steakhouse   
 6 Walt Disney World         
 7 the Tanning Yards         
 8 Moonshadows Restaurant    
 9 Refugio County Court House
10 The Legend of Big Liz

What are the most common words used for the haunted locations?

library(tidytext)

haunted_places |> 
  unnest_tokens(word, location) |> 
  count(word, sort = TRUE)

# A tibble: 7,765 × 2
   word           n
   <chr>      <int>
 1 school      1217
 2 the          989
 3 cemetery     751
 4 high         700
 5 old          599
 6 house        502
 7 university   500
 8 road         437
 9 of           406
10 college      373
# ℹ 7,755 more rows

Looks like there are lots of haunted cemeteries, which seems reasonable to me, but also lots of haunted schools! 😱 Haunted schools??!? Won’t someone think of the children? I don’t approve at all.

Weighted log odds

I’d like to know what states are more likely to have haunted cemeteries and what states are more likely to have haunted schools, so I can know which states keep their hauntings where they belong. Let’s use tidylo, a package for weighted log odds using tidy data principles. A log odds ratio is a way of expressing probabilities, and we can weight a log odds ratio so that our implementation does a better job dealing with different features having different counts. Think about how the different states have different numbers of haunted places; we’d like to compute a log odds ratio that gives us a more robust estimate across the states with many haunted places and those with few. You can read more about tidylo in a previous blog post.

To start, let’s create a dataset of counts:

haunted_counts <-
  haunted_places |> 
  mutate(location = str_to_lower(location)) |> 
  mutate(location = case_when(
    str_detect(location, "cemetery|graveyard") ~ "cemetery",
    str_detect(location, "school") ~ "school",
    .default = "other"
  )) |> 
  count(state_abbrev, location)

haunted_counts

# A tibble: 149 × 3
   state_abbrev location     n
   <chr>        <chr>    <int>
 1 AK           cemetery     4
 2 AK           other       23
 3 AK           school       5
 4 AL           cemetery    19
 5 AL           other      187
 6 AL           school      18
 7 AR           cemetery    16
 8 AR           other       91
 9 AR           school      12
10 AZ           cemetery     3
# ℹ 139 more rows

Now we can compute the log odds, weighted (the tidylo default) via empirical Bayes:

library(tidylo)

haunted_counts |> 
  bind_log_odds(location, state_abbrev, n) |> 
  group_by(location) |> 
  slice_max(log_odds_weighted, n = 5) |> 
  mutate(state_abbrev = reorder(state_abbrev, log_odds_weighted)) |> 
  ggplot(aes(log_odds_weighted, state_abbrev, fill = location)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(vars(location), scales = "free") +
  scale_fill_brewer(palette = "Dark2") +
  labs(x = "Weighted log odds (empirical Bayes)", y = NULL)

This plot shows us that:

Illinois, Idaho, and Connecticut are reasonable places where haunted locations are more likely to be cemeteries compared to other states.
The haunting situation in California, South Dakota, and North Dakota is not good as their haunted locations are more likely to be schools. Beware! 👻
Delaware, Washington DC, and Maryland have haunted locations that are more likely to be something other than cemeteries and schools, compared to the other states. What are some of these, specifically in DC?

haunted_places |> 
  filter(state_abbrev == "DC") |> 
  slice_sample(n = 10) |> 
  select(location)

# A tibble: 10 × 1
   location        
   <chr>           
 1 White House     
 2 White House     
 3 Trinity College 
 4 Capitol Building
 5 Fort McNair     
 6 White House     
 7 White House     
 8 The Hay         
 9 Capitol building
10 Georgetown

Seems like the White House is pretty haunted!

Posted on:: October 11, 2023

Length:: 4 minute read, 755 words

Categories:: rstats

Tags:: rstats

See Also:: Release an R package with Positron; Positron in action with #TidyTuesday orca encounters; Educational attainment in #TidyTuesday UK towns