How often does Roy Kent say "F*CK"?

By Julia Silge in rstats tidymodels

September 27, 2023

This is the latest in my series of screencasts! This screencast focuses on how to compute bootstrap confidence intervals, using this week’s #TidyTuesday dataset on Roy Kent’s colorful language. 🤬


Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Explore data

It’s hard not to love the show Ted Lasso or one of its most compelling characters, Roy Kent, and our modeling goal here is to estimate how his use of what appears to be his favorite word depends on his coaching status and/or his dating status. This dataset was created by Deepsha Menghani for her excellent talk at posit::conf(2023), and you should definitely keep an eye out for the video when it becomes available later this year. Let’s start by reading in the data:

library(tidyverse)
library(richmondway)
data(richmondway)
glimpse(richmondway)
## Rows: 34
## Columns: 16
## $ Character         <chr> "Roy Kent", "Roy Kent", "Roy Kent", "Roy Kent", "Roy…
## $ Episode_order     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ Season            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Episode           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, …
## $ Season_Episode    <chr> "S1_e1", "S1_e2", "S1_e3", "S1_e4", "S1_e5", "S1_e6"…
## $ F_count_RK        <dbl> 2, 2, 7, 8, 4, 2, 5, 7, 14, 5, 11, 10, 2, 2, 23, 12,…
## $ F_count_total     <dbl> 13, 8, 13, 17, 13, 9, 15, 18, 22, 22, 16, 22, 8, 6, …
## $ cum_rk_season     <dbl> 2, 4, 11, 19, 23, 25, 30, 37, 51, 56, 11, 21, 23, 25…
## $ cum_total_season  <dbl> 13, 21, 34, 51, 64, 73, 88, 106, 128, 150, 16, 38, 4…
## $ cum_rk_overall    <dbl> 2, 4, 11, 19, 23, 25, 30, 37, 51, 56, 67, 77, 79, 81…
## $ cum_total_overall <dbl> 13, 21, 34, 51, 64, 73, 88, 106, 128, 150, 166, 188,…
## $ F_score           <dbl> 0.1538462, 0.2500000, 0.5384615, 0.4705882, 0.307692…
## $ F_perc            <dbl> 15.4, 25.0, 53.8, 47.1, 30.8, 22.2, 33.3, 38.9, 63.6…
## $ Dating_flag       <chr> "No", "No", "No", "No", "No", "No", "No", "Yes", "Ye…
## $ Coaching_flag     <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No"…
## $ Imdb_rating       <dbl> 7.8, 8.1, 8.5, 8.2, 8.9, 8.5, 9.0, 8.7, 8.6, 9.1, 7.…

This is not what you call a large dataset but we can check out the distribution of how often Roy Kent says “f*ck” per episode. Can we compare when he is dating Keeley vs. not?

richmondway |> 
    ggplot(aes(F_count_RK, fill = Dating_flag)) +
    geom_histogram(position = "identity", bins = 7, alpha = 0.7) +
    scale_fill_brewer(palette = "Dark2")

Or what about when he coaching vs. not?

richmondway |> 
    ggplot(aes(F_count_RK, fill = Coaching_flag)) +
    geom_histogram(position = "identity", bins = 7, alpha = 0.7) +
    scale_fill_brewer(palette = "Dark2")

It looks like maybe there are differences here but it’s a small dataset, so let’s use statistical modeling to help us be more sure what we are seeing.

Bootstrap confidence intervals for Poisson regression

There isn’t much code in what we’re about to do, but let’s outline two important pieces of what is going on:

  • These are counts of Roy Kent’s F-bombs per episode, so we want to use a model that is a good fit for count data, i.e. Poisson regression.
  • We could fit a Poisson regression model one time to this dataset, but it’s such a tiny dataset that we might not have much confidence in the results. Instead, we want to use bootstrap resamples to fit our model a whole bunch of times to get confidence intervals, and then use these replicate results to estimate the impact of coaching and dating.

We can use the reg_intervals() function from rsample to do this all at once. If we use keep_reps = TRUE, we will get each individual model result in our results:

library(rsample)

set.seed(123)
poisson_intervals <- 
    reg_intervals(
        F_count_RK ~ Dating_flag + Coaching_flag, 
        data = richmondway, 
        model_fn = "glm", 
        family = "poisson",
        keep_reps = TRUE
    )

poisson_intervals
## # A tibble: 2 × 7
##   term             .lower .estimate .upper .alpha .method          .replicates
##   <chr>             <dbl>     <dbl>  <dbl>  <dbl> <chr>     <list<tibble[,2]>>
## 1 Coaching_flagYes  0.236    0.654   1.04    0.05 student-t        [1,001 × 2]
## 2 Dating_flagYes   -0.351    0.0428  0.448   0.05 student-t        [1,001 × 2]

Notice the .replicates column where we have each of the 1000 results from our 1000 bootstrap resamples. We can unnest() this column and make a visualization:

poisson_intervals |>
    mutate(term = str_remove(term, "_flagYes")) |> 
    unnest(.replicates) |>
    ggplot(aes(estimate, fill = term)) +
    geom_vline(xintercept = 0, linewidth = 1.5, lty = 2, color = "gray50") +
    geom_histogram(alpha = 0.8, show.legend = FALSE) +
    facet_wrap(vars(term)) +
    scale_fill_brewer(palette = "Accent")

Looks like we have strong evidence that Roy Kent says “F*CK” more per episode when he is coaching, but no real evidence that he does so more or less when dating Keeley.

Posted on:
September 27, 2023
Length:
4 minute read, 828 words
Categories:
rstats tidymodels
Tags:
rstats tidymodels
See Also:
Educational attainment in #TidyTuesday UK towns
Changes in #TidyTuesday US polling places
Empirical Bayes for #TidyTuesday Doctor Who episodes