Multiclass predictive modeling for #TidyTuesday NBER papers
By Julia Silge in rstats tidymodels
September 29, 2021
This is the latest in my series of
screencasts demonstrating how to use the
tidymodels packages, from just getting started to tuning more complex models. Todayβs screencast walks through how to build, tune, and evaluate a multiclass predictive model with text features and lasso regularization, with this weekβs
#TidyTuesday
dataset on NBER working papers. π
Here is the code I used in the video, for those who prefer reading instead of or in addition to video.
Explore data
Our modeling goal is to predict the category of National Bureau of Economic Research working papers from the titles and years of the papers.
library(tidyverse)
papers <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-28/papers.csv")
programs <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-28/programs.csv")
paper_authors <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-28/paper_authors.csv")
paper_programs <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-28/paper_programs.csv")
Letβs start by joining up these datasets to find the info we need.
papers_joined <-
paper_programs %>%
left_join(programs) %>%
left_join(papers) %>%
filter(!is.na(program_category)) %>%
distinct(paper, program_category, year, title)
papers_joined %>%
count(program_category)
## # A tibble: 3 Γ 2
## program_category n
## <chr> <int>
## 1 Finance 4336
## 2 Macro/International 12012
## 3 Micro 18527
The papers are in three categories (finance, microeconomics, and macroeconomics) so weβll be training a multiclass predictive model, not a binary classification model as we often see or use.
Letβs create one exploratory plot before we move on to modeling.
library(tidytext)
library(tidylo)
title_log_odds <-
papers_joined %>%
unnest_tokens(word, title) %>%
filter(!is.na(program_category)) %>%
count(program_category, word, sort = TRUE) %>%
bind_log_odds(program_category, word, n)
title_log_odds %>%
group_by(program_category) %>%
slice_max(log_odds_weighted, n = 10) %>%
ungroup() %>%
ggplot(aes(log_odds_weighted,
fct_reorder(word, log_odds_weighted),
fill = program_category
)) +
geom_col(show.legend = FALSE) +
facet_wrap(vars(program_category), scales = "free_y") +
labs(x = "Log odds (weighted)", y = NULL)
These type of relationships between category and title words are what we want to use in our predictive model.
Build and tune a model
Letβs start our modeling by setting up our βdata budget.β Weβll stratify by our outcome program_category
.
library(tidymodels)
set.seed(123)
nber_split <- initial_split(papers_joined, strata = program_category)
nber_train <- training(nber_split)
nber_test <- testing(nber_split)
set.seed(234)
nber_folds <- vfold_cv(nber_train, strata = program_category)
nber_folds
## # 10-fold cross-validation using stratification
## # A tibble: 10 Γ 2
## splits id
## <list> <chr>
## 1 <split [23539/2617]> Fold01
## 2 <split [23539/2617]> Fold02
## 3 <split [23540/2616]> Fold03
## 4 <split [23540/2616]> Fold04
## 5 <split [23540/2616]> Fold05
## 6 <split [23541/2615]> Fold06
## 7 <split [23541/2615]> Fold07
## 8 <split [23541/2615]> Fold08
## 9 <split [23541/2615]> Fold09
## 10 <split [23542/2614]> Fold10
Next, letβs set up our feature engineering. We will need to transform our text data into features useful for our model by tokenizing and computing (in this case) tf-idf. Letβs also downsample since our dataset is imbalanced, with many more of some of the categories than others.
library(themis)
library(textrecipes)
nber_rec <-
recipe(program_category ~ year + title, data = nber_train) %>%
step_tokenize(title) %>%
step_tokenfilter(title, max_tokens = 200) %>%
step_tfidf(title) %>%
step_downsample(program_category)
nber_rec
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 2
##
## Operations:
##
## Tokenization for title
## Text filtering for title
## Term frequency-inverse document frequency with title
## Down-sampling based on program_category
Then, letβs create our model specification for a lasso model. We need to use a model specification that can handle multiclass data, in this case multinom_reg()
.
multi_spec <-
multinom_reg(penalty = tune(), mixture = 1) %>%
set_mode("classification") %>%
set_engine("glmnet")
multi_spec
## Multinomial Regression Model Specification (classification)
##
## Main Arguments:
## penalty = tune()
## mixture = 1
##
## Computational engine: glmnet
Now itβs time to put the preprocessing and model together in a workflow()
.
nber_wf <- workflow(nber_rec, multi_spec)
nber_wf
## ββ Workflow ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
## Preprocessor: Recipe
## Model: multinom_reg()
##
## ββ Preprocessor ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
## 4 Recipe Steps
##
## β’ step_tokenize()
## β’ step_tokenfilter()
## β’ step_tfidf()
## β’ step_downsample()
##
## ββ Model βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
## Multinomial Regression Model Specification (classification)
##
## Main Arguments:
## penalty = tune()
## mixture = 1
##
## Computational engine: glmnet
Since the lasso regularization penalty
is a hyperparameter of the model (we canβt find the best value from fitting the model a single time), letβs tune over a grid of possible penalty
parameters.
nber_grid <- grid_regular(penalty(range = c(-5, 0)), levels = 20)
doParallel::registerDoParallel()
set.seed(2021)
nber_rs <-
tune_grid(
nber_wf,
nber_folds,
grid = nber_grid
)
nber_rs
## # Tuning results
## # 10-fold cross-validation using stratification
## # A tibble: 10 Γ 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [23539/2617]> Fold01 <tibble [40 Γ 5]> <tibble [0 Γ 1]>
## 2 <split [23539/2617]> Fold02 <tibble [40 Γ 5]> <tibble [0 Γ 1]>
## 3 <split [23540/2616]> Fold03 <tibble [40 Γ 5]> <tibble [0 Γ 1]>
## 4 <split [23540/2616]> Fold04 <tibble [40 Γ 5]> <tibble [0 Γ 1]>
## 5 <split [23540/2616]> Fold05 <tibble [40 Γ 5]> <tibble [0 Γ 1]>
## 6 <split [23541/2615]> Fold06 <tibble [40 Γ 5]> <tibble [0 Γ 1]>
## 7 <split [23541/2615]> Fold07 <tibble [40 Γ 5]> <tibble [0 Γ 1]>
## 8 <split [23541/2615]> Fold08 <tibble [40 Γ 5]> <tibble [0 Γ 1]>
## 9 <split [23541/2615]> Fold09 <tibble [40 Γ 5]> <tibble [0 Γ 1]>
## 10 <split [23542/2614]> Fold10 <tibble [40 Γ 5]> <tibble [0 Γ 1]>
This is a pretty fast model to fit, since it is linear. How did it turn out?
autoplot(nber_rs)
show_best(nber_rs)
## # A tibble: 5 Γ 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.00234 roc_auc hand_till 0.784 10 0.00249 Preprocessor1_Model10
## 2 0.00428 roc_auc hand_till 0.783 10 0.00244 Preprocessor1_Model11
## 3 0.00127 roc_auc hand_till 0.783 10 0.00251 Preprocessor1_Model09
## 4 0.000695 roc_auc hand_till 0.782 10 0.00253 Preprocessor1_Model08
## 5 0.000379 roc_auc hand_till 0.782 10 0.00254 Preprocessor1_Model07
Choose and evaluate a final model
We could use the numerically best model with select_best()
by often with regularized models we would rather choose a simpler model within some limits of performance. We can choose using the βone-standard error ruleβ with select_by_one_std_err()
and then use last_fit()
to fit one time to the training data and evaluate one time to the testing data.
final_penalty <-
nber_rs %>%
select_by_one_std_err(metric = "roc_auc", desc(penalty))
final_penalty
## # A tibble: 1 Γ 9
## penalty .metric .estimator mean n std_err .config .best .bound
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> <dbl> <dbl>
## 1 0.00428 roc_auc hand_till 0.783 10 0.00244 Preprocessor1_Mod⦠0.784 0.781
final_rs <-
nber_wf %>%
finalize_workflow(final_penalty) %>%
last_fit(nber_split)
final_rs
## # Resampling results
## # Manual resampling
## # A tibble: 1 Γ 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>
## 1 <split [26156/8719]> train/test split <tibble β¦ <tibblβ¦ <tibble [8,β¦ <workfloβ¦
How did our final model perform on the testing data?
collect_metrics(final_rs)
## # A tibble: 2 Γ 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy multiclass 0.609 Preprocessor1_Model1
## 2 roc_auc hand_till 0.779 Preprocessor1_Model1
We can visualize the difference in performance across classes with a confusion matrix.
collect_predictions(final_rs) %>%
conf_mat(program_category, .pred_class) %>%
autoplot()
We can also visualize the ROC curves for each class.
collect_predictions(final_rs) %>%
roc_curve(truth = program_category, .pred_Finance:.pred_Micro) %>%
ggplot(aes(1 - specificity, sensitivity, color = .level)) +
geom_abline(slope = 1, color = "gray50", lty = 2, alpha = 0.8) +
geom_path(size = 1.5, alpha = 0.7) +
labs(color = NULL) +
coord_fixed()
It looks like the finance and microeconomics papers were easier to identify than the macroeconomics papers.
Finally, we can extract (and save, if we like) the fitted workflow from our results to use for predicting on new data.
final_fitted <- extract_workflow(final_rs)
## can save this for prediction later with readr::write_rds()
predict(final_fitted, nber_test[111, ], type = "prob")
## # A tibble: 1 Γ 3
## .pred_Finance `.pred_Macro/International` .pred_Micro
## <dbl> <dbl> <dbl>
## 1 0.104 0.531 0.365
We can even make up new paper titles and see how our model classifies them.
predict(final_fitted, tibble(year = 2021, title = "Pricing Models for Corporate Responsibility"), type = "prob")
## # A tibble: 1 Γ 3
## .pred_Finance `.pred_Macro/International` .pred_Micro
## <dbl> <dbl> <dbl>
## 1 0.598 0.158 0.244
predict(final_fitted, tibble(year = 2021, title = "Teacher Health and Medicaid Expansion"), type = "prob")
## # A tibble: 1 Γ 3
## .pred_Finance `.pred_Macro/International` .pred_Micro
## <dbl> <dbl> <dbl>
## 1 0.288 0.141 0.571
- Posted on:
- September 29, 2021
- Length:
- 7 minute read, 1372 words
- Categories:
- rstats tidymodels
- Tags:
- rstats tidymodels