Handling model coefficients for #TidyTuesday collegiate sports
By Julia Silge in rstats tidymodels
April 9, 2022
This is the latest in my series of
screencasts demonstrating
how to use the
tidymodels packages. This
screencast is less about predictive modeling and more about how to
handle and generate model coefficients with tidymodels. Let’s learn more
about this using the
#TidyTuesday
dataset on collegiate
sports in the US. 🏈
Here is the code I used in the video, for those who prefer reading instead of or in addition to video.
Explore data
Our modeling goal is to understand what affects expenditures on collegiate sports in the US. How many different sports are there in this dataset?
library(tidyverse)
sports_raw <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-03-29/sports.csv')
unique(sports_raw$sports)
[1] "Baseball" "Basketball"
[3] "All Track Combined" "Football"
[5] "Golf" "Soccer"
[7] "Softball" "Tennis"
[9] "Volleyball" "Bowling"
[11] "Rifle" "Beach Volleyball"
[13] "Ice Hockey" "Lacrosse"
[15] "Gymnastics" "Rowing"
[17] "Swimming and Diving" "Track and Field, X-Country"
[19] "Equestrian" "Track and Field, Indoor"
[21] "Track and Field, Outdoor" "Wrestling"
[23] "Other Sports" "Rodeo"
[25] "Skiing" "Swimming"
[27] "Water Polo" "Archery"
[29] "Field Hockey" "Fencing"
[31] "Sailing" "Badminton"
[33] "Squash" "Diving"
[35] "Synchronized Swimming" "Table Tennis"
[37] "Weight Lifting" "Team Handball"
Let’s combine some of those sports categories:
sports_parsed <- sports_raw %>%
mutate(sports = case_when(
str_detect(sports, "Swimming") ~ "Swimming and Diving",
str_detect(sports, "Diving") ~ "Swimming and Diving",
str_detect(sports, "Track") ~ "Track",
TRUE ~ sports
))
unique(sports_parsed$sports)
[1] "Baseball" "Basketball" "Track"
[4] "Football" "Golf" "Soccer"
[7] "Softball" "Tennis" "Volleyball"
[10] "Bowling" "Rifle" "Beach Volleyball"
[13] "Ice Hockey" "Lacrosse" "Gymnastics"
[16] "Rowing" "Swimming and Diving" "Equestrian"
[19] "Wrestling" "Other Sports" "Rodeo"
[22] "Skiing" "Water Polo" "Archery"
[25] "Field Hockey" "Fencing" "Sailing"
[28] "Badminton" "Squash" "Table Tennis"
[31] "Weight Lifting" "Team Handball"
Let’s choose some variables to explore further and create a dataset with
bind_rows()
that has one row for each sport and gender.
sports <- bind_rows(
sports_parsed %>%
select(year, institution_name, sports,
participants = partic_men,
revenue = rev_men,
expenditure = exp_men) %>%
mutate(gender = "men"),
sports_parsed %>%
select(year, institution_name, sports,
participants = partic_women,
revenue = rev_women,
expenditure = exp_women) %>%
mutate(gender = "women")
) %>%
na.omit()
sports
# A tibble: 130,748 × 7
year institution_name sports participants revenue expenditure gender
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr>
1 2015 Alabama A & M University Baseb… 31 345592 397818 men
2 2015 Alabama A & M University Baske… 19 1211095 817868 men
3 2015 Alabama A & M University Track 61 183333 246949 men
4 2015 Alabama A & M University Footb… 99 2808949 3059353 men
5 2015 Alabama A & M University Golf 9 78270 83913 men
6 2015 Alabama A & M University Tennis 7 78274 99612 men
7 2015 University of Alabama a… Baseb… 32 1286361 1245150 men
8 2015 University of Alabama a… Baske… 13 4189826 4189826 men
9 2015 University of Alabama a… Golf 10 407728 407728 men
10 2015 University of Alabama a… Soccer 33 1062855 1052063 men
# … with 130,738 more rows
In the screencast I did more EDA, but here let’s just make one exploratory plot.
sports %>%
mutate(sports = fct_reorder(sports, expenditure)) %>%
ggplot(aes(expenditure, sports, fill = gender, color = gender)) +
geom_boxplot(position = position_dodge(preserve = "single"), alpha = 0.2) +
scale_x_log10(labels = scales::dollar) +
theme(legend.position = "top") +
scale_fill_brewer(palette = "Dark2") +
scale_color_brewer(palette = "Dark2") +
labs(y = NULL, color = NULL, fill = NULL, x = "Expenditures per team")
Notice the log scale and those outliers for sports like football and men’s basketball! 😳 It doesn’t look like there is much difference between men and women for any given sport.
Build linear models
Let’s take a straightforward, “native R” approach to fitting two linear models for this data:
-
explaining expenditures based on number of participants and gender
-
the same, but adding in sport as a predictor to estimate the impact of different sports on how much money is spent per team
ignore_sport <-
lm(expenditure ~ gender + participants, data = sports)
account_for_sport <-
lm(expenditure ~ gender + participants + sports, data = sports)
In tidymodels, we recommend using broom to handle the output of models like these, so we can more easily handle, manipulate, and visualize our results. Check out Chapter 3 of Tidy Modeling with R for more on this topic!
library(broom)
bind_rows(
tidy(ignore_sport) %>% mutate(sport = "ignore"),
tidy(account_for_sport) %>% mutate(sport = "account for sport")
) %>%
filter(!str_detect(term, "sports"), term != "(Intercept)") %>%
ggplot(aes(estimate, term, color = sport)) +
geom_vline(xintercept = 0, size = 1.5, lty = 2, color = "gray50") +
geom_errorbar(size = 1.4, alpha = 0.7,
aes(xmin = estimate - 1.96 * std.error, xmax = estimate + 1.96 * std.error)) +
geom_point(size = 3) +
scale_x_continuous(labels = scales::dollar) +
theme(legend.position="bottom") +
scale_color_brewer(palette = "Accent") +
labs(x = "Change in expenditures", y = NULL, color = "Include sport in model?",
title = "Expenditures on college sports",
subtitle = "Colleges spend less on women's sports overall\nFor the same sport, we don't see differences by gender")
We see here that colleges spend less per team overall on women’s sports, but this isn’t true when we control for sport. Basically, it’s just football driving the differences between men and women! Also, when we account for sport, the increase in expenditure per participant comes down a lot.
Bootstrap intervals
We used the standard intervals from lm()
in the section above, but
what if we’re worried about the assumptions of OLS and/or just want to
create more robust interval estimates? We can use
bootstrap
intervals instead.
There are
several ways to estimate bootstrap
intervals in
tidymodels, but the
simplest is using reg_intervals()
from
rsample:
library(rsample)
set.seed(123)
ignore_intervals <-
reg_intervals(expenditure ~ gender + participants, data = sports, times = 500)
set.seed(123)
account_for_sport_intervals <-
reg_intervals(expenditure ~ gender + participants + sports, data = sports, times = 500)
What are the estimates for the change in expenditures for each sport?
account_for_sport_intervals %>%
filter(str_detect(term, "sports")) %>%
arrange(desc(.estimate))
# A tibble: 30 × 6
term .lower .estimate .upper .alpha .method
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 sportsFootball 2634926. 2835381. 3067996. 0.05 student-t
2 sportsGymnastics 644231. 704069. 761619. 0.05 student-t
3 sportsIce Hockey 523371. 578347. 648220. 0.05 student-t
4 sportsBasketball 549469. 575278. 602532. 0.05 student-t
5 sportsEquestrian 109748. 204648. 305266. 0.05 student-t
6 sportsRifle 89728. 155763. 210560. 0.05 student-t
7 sportsVolleyball 130863. 146536. 161185. 0.05 student-t
8 sportsSkiing 100897. 126746. 151963. 0.05 student-t
9 sportsRowing 73766. 119037. 159192. 0.05 student-t
10 sportsField Hockey 89872. 118878. 142696. 0.05 student-t
# … with 20 more rows
The difference between football and the next sport is LARGE. Let’s make a similar plot for the model coefficients as in the last section.
bind_rows(
ignore_intervals %>% mutate(sport = "ignore"),
account_for_sport_intervals %>% mutate(sport = "account for sport")
) %>%
filter(!str_detect(term, "sports")) %>%
ggplot(aes(.estimate, term, color = sport)) +
geom_vline(xintercept = 0, size = 1.5, lty = 2, color = "gray50") +
geom_errorbar(size = 1.4, alpha = 0.7,
aes(xmin = .lower, xmax = .upper)) +
geom_point(size = 3) +
scale_x_continuous(labels = scales::dollar) +
scale_color_brewer(palette = "Accent") +
theme(legend.position="bottom") +
labs(x = "Change in expenditures", y = NULL, color = "Include sport in model?",
title = "Bootstrap confidence intervals for expenditures in college sports",
subtitle = "Colleges spend less on women's sports overall\nIn any specific sport, we don't see evidence for differences")
This plot looks very similar, although the relative size of the intervals for gender and number of participants has changed (intervals for number of participants are larger; intervals for gender are smaller). Again, we see that overall, the expenditures per team are much less for women’s sports, but that we don’t have evidence for differences within individual sports.
- Posted on:
- April 9, 2022
- Length:
- 6 minute read, 1242 words
- Categories:
- rstats tidymodels
- Tags:
- rstats tidymodels