Empirical Bayes for #TidyTuesday Doctor Who episodes

By Julia Silge in rstats

November 29, 2023

This is the latest in my series of screencasts! I love Doctor Who and cannot pass up the opportunity to dig in to this week’s #TidyTuesday dataset on Doctor Who episodes. I wrote about using tidymodels for Doctor Who viewership in 2021 and I’m excited to revisit this topic, focusing this time on the writers of Doctor Who episodes. 💙


Here is the code I used in the video, for those who prefer reading instead of or in addition to video. FYI, I recently started using GitHub Copilot in RStudio and you can see it in action in the video.

Explore data

Our analysis goal is to see which Doctor Who writers have episodes with higher ratings. Let’s start by reading in the data, both for episodes and writers, and joining these two datasets together. Let’s compute the mean and variance of the ratings for each writer:

library(tidyverse)
library(datardis)
data("drwho_episodes")
data("drwho_writers")

dr_who_joined <-
  drwho_episodes |> 
  left_join(drwho_writers) |> 
  mutate(rating = as.numeric(rating)) |>
  group_by(writer) |>
  summarize(n = n(), 
            std_var = var(rating) / n,
            rating = mean(rating)) 

dr_who_joined
## # A tibble: 40 × 4
##    writer                               n std_var rating
##    <chr>                            <int>   <dbl>  <dbl>
##  1 Catherine Tregenna                   1  NA       81  
##  2 Charlene James                       1  NA       78  
##  3 Chris Chibnall                      29   0.419   80.7
##  4 Ed Hime                              2   2.25    78.5
##  5 Ella Road                            1  NA       76  
##  6 Frank Cottrell-Boyce                 2   0       83  
##  7 Gareth Roberts                       5   0.14    86.8
##  8 Gareth Roberts and Steven Moffat     1  NA       83  
##  9 Helen Raynor                         4   0.417   86.5
## 10 James Moran                          1  NA       87  
## # ℹ 30 more rows

How are the ratings distributed, especially as related to the number of episodes per writer?

ggplot(dr_who_joined, aes(n, rating)) +
  geom_point(size = 2) +
  labs(x = "Number of episodes", y = "Mean rating")

I feel much more certain about the mean rating for the writers who have written a lot of episodes than the ones who have written only a few. How sure are we about those ratings for writer with only a couple of episodes each? This is exactly the kind of question we can answer with empirical Bayes. I wrote about getting started myself with empirical Bayes back in 2016, and I recommend the blog posts and book by my friend David Robinson to learn more.

Learning a prior from the data

What makes empirical Bayes “empirical” is that we learn a prior probability distribution from the data itself. I am going dead simple here, and using a normal distribution for the prior.

prior_mu <- mean(dr_who_joined$rating)
prior_var <- var(dr_who_joined$rating)

Can’t get much simpler than that! Let’s plot this prior distribution together with our distribution of ratings.

ggplot(dr_who_joined) +
    geom_histogram(aes(rating, after_stat(density)), alpha = 0.6) +
    stat_function(fun = dnorm, 
                  args = list(mean = prior_mu, sd = sqrt(prior_var)), 
                  color = "gray30", linewidth = 1.5, lty = 2)

Looks reasonable! If you had more data than we do here, you could use a more complex prior distribution or do a more careful job of fitting the distribution. Given that we only have a couple dozen writers, I think this is good enough for now.

Shrinkage

Now we can use Bayes Theorem to compute posterior probabilities for each writer. We can calculate:

  • the shrinkage factor from the variance of the prior and the variance for each writer
  • the “shrunk” rating from the measured rating, the mean of the prior, and that shrinkage factor
dr_who_empirical <-
    dr_who_joined |> 
    mutate(
        shrinkage = prior_var / (prior_var + std_var),
        rating_shrunk = shrinkage * rating + (1 - shrinkage) * prior_mu
    )

dr_who_empirical
## # A tibble: 40 × 6
##    writer                               n std_var rating shrinkage rating_shrunk
##    <chr>                            <int>   <dbl>  <dbl>     <dbl>         <dbl>
##  1 Catherine Tregenna                   1  NA       81      NA              NA  
##  2 Charlene James                       1  NA       78      NA              NA  
##  3 Chris Chibnall                      29   0.419   80.7     0.950          80.8
##  4 Ed Hime                              2   2.25    78.5     0.778          79.6
##  5 Ella Road                            1  NA       76      NA              NA  
##  6 Frank Cottrell-Boyce                 2   0       83       1              83  
##  7 Gareth Roberts                       5   0.14    86.8     0.983          86.7
##  8 Gareth Roberts and Steven Moffat     1  NA       83      NA              NA  
##  9 Helen Raynor                         4   0.417   86.5     0.950          86.3
## 10 James Moran                          1  NA       87      NA              NA  
## # ℹ 30 more rows

Notice that with this very simple approach, we can’t compute a rating with empirical Bayes for a writer with only one episode, because we can’t compute a variance. For the rest of the writers, how do the measured ratings compare to the “shrunk” ratings?

library(ggrepel)

dr_who_empirical |> 
    ggplot(aes(rating, rating_shrunk)) +
    geom_abline(slope = 1, color = "gray60", linewidth = 1.5, lty = 2) +
    geom_point(aes(size = n)) +
    geom_text_repel(aes(label = writer), family = "IBMPlexSans") +
    coord_fixed()

Notice that for writers with only a few episodes, lower ratings are pulled up while higher ratings are pulled down; they are “shrunk” toward the mean. For writers with many episodes, we don’t see much change. I notice that the rating for Neil Gaiman is pulled down quite a bit; my own personal prior here would say that’s not right at all and I would expect any additional episodes by Gaiman to still be very highly rated. 😆

Posted on:
November 29, 2023
Length:
5 minute read, 868 words
Categories:
rstats
Tags:
rstats
See Also:
Changes in #TidyTuesday US polling places
Logistic regression modeling for #TidyTuesday US House Elections
Topic modeling for #TidyTuesday Taylor Swift lyrics