Educational attainment in #TidyTuesday UK towns

April 24, 2024

This is the latest in my series of screencasts! This screencast focuses on how to use Posit Team for the ML lifecycle from EDA through model development and then deployment, with a recent #TidyTuesday dataset on educational attainment in UK towns. This screencast and blog post are part of the monthly Workflows with Posit Team series.

Here is the code I used in the video, for those who prefer reading instead of or in addition to video. In the video, you’ll notice I’m working on Posit Workbench.

Explore data

Our modeling goal is to predict the educational attainment in UK towns, based on various characteristics of those towns. Let’s start by reading in the data:

library(tidyverse)

education <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-01-23/english_education.csv')
glimpse(education)

## Rows: 1,104
## Columns: 31
## $ town11cd                                                          <chr> "E34…
## $ town11nm                                                          <chr> "Car…
## $ population_2011                                                   <dbl> 5456…
## $ size_flag                                                         <chr> "Sma…
## $ rgn11nm                                                           <chr> "Eas…
## $ coastal                                                           <chr> "Non…
## $ coastal_detailed                                                  <chr> "Sma…
## $ ttwa11cd                                                          <chr> "E30…
## $ ttwa11nm                                                          <chr> "Wor…
## $ ttwa_classification                                               <chr> "Maj…
## $ job_density_flag                                                  <chr> "Res…
## $ income_flag                                                       <chr> "Hig…
## $ university_flag                                                   <chr> "No …
## $ level4qual_residents35_64_2011                                    <chr> "Low…
## $ ks4_2012_2013_counts                                              <dbl> 65, …
## $ key_stage_2_attainment_school_year_2007_to_2008                   <dbl> 65.0…
## $ key_stage_4_attainment_school_year_2012_to_2013                   <dbl> 70.7…
## $ level_2_at_age_18                                                 <dbl> 72.3…
## $ level_3_at_age_18                                                 <dbl> 50.7…
## $ activity_at_age_19_full_time_higher_education                     <dbl> 30.7…
## $ activity_at_age_19_sustained_further_education                    <dbl> 21.5…
## $ activity_at_age_19_appprenticeships                               <dbl> NA, …
## $ activity_at_age_19_employment_with_earnings_above_0               <dbl> 52.3…
## $ activity_at_age_19_employment_with_earnings_above_10_000          <dbl> 36.9…
## $ activity_at_age_19_out_of_work                                    <dbl> NA, …
## $ highest_level_qualification_achieved_by_age_22_less_than_level_1  <dbl> NA, …
## $ highest_level_qualification_achieved_by_age_22_level_1_to_level_2 <dbl> 34.9…
## $ highest_level_qualification_achieved_by_age_22_level_3_to_level_5 <dbl> 39.7…
## $ highest_level_qualification_achieved_by_age_22_level_6_or_above   <dbl> NA, …
## $ highest_level_qualification_achieved_b_age_22_average_score       <dbl> 3.32…
## $ education_score                                                   <dbl> -0.5…

How does the education_score vary across towns and cities?

education |>
  filter(!is.na(income_flag)) |>
  mutate(income_flag = factor(income_flag, levels = c("Lower deprivation towns",
                                                      "Mid deprivation towns",
                                                      "Higher deprivation towns",
                                                      "Cities"))) |> 
  ggplot(aes(education_score, income_flag, fill = size_flag)) +
  geom_boxplot(alpha = 0.2) +
  labs(y = NULL, fill = NULL)

It definitely looks like there is variation across income and size, and we will want to train a model that can use that variation to predict the educational attainment.

Build a model

We can start by loading the tidymodels metapackage, splitting our data into training and testing sets, and creating bootstrap resamples. Think about this stage as spending your data budget.

library(tidymodels)

set.seed(123)
edu_split <-
  education |> 
  filter(str_detect(income_flag, "deprivation towns")) |>
  select(education_score, income_flag, size_flag, coastal, university_flag, rgn11nm) |>
  initial_split(strata = education_score)

edu_train <- training(edu_split)
edu_test <- testing(edu_split)
set.seed(234)
edu_folds <- bootstraps(edu_train, strata = education_score)
edu_folds

## # Bootstrap sampling using stratification 
## # A tibble: 25 × 2
##    splits            id         
##    <list>            <chr>      
##  1 <split [810/298]> Bootstrap01
##  2 <split [810/294]> Bootstrap02
##  3 <split [810/302]> Bootstrap03
##  4 <split [810/307]> Bootstrap04
##  5 <split [810/300]> Bootstrap05
##  6 <split [810/311]> Bootstrap06
##  7 <split [810/272]> Bootstrap07
##  8 <split [810/302]> Bootstrap08
##  9 <split [810/298]> Bootstrap09
## 10 <split [810/306]> Bootstrap10
## # ℹ 15 more rows

Now let’s create a modeling workflow for this data, with a straightforward formula preprocessor and a random forest model. Random forest models tend to work well with rectangular data as long as you have enough trees, even without hyperparameter tuning. How well does this do in our situation here?

library(future)
plan(multisession, workers = 4)

edu_wf <- workflow(
  education_score ~ ., 
  rand_forest(mode = "regression", trees = 500)
)

set.seed(123)
edu_res <- fit_resamples(edu_wf, edu_folds)
collect_metrics(edu_res)

## # A tibble: 2 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   2.71     25 0.0213  Preprocessor1_Model1
## 2 rsq     standard   0.458    25 0.00676 Preprocessor1_Model1

The rmse metric is on the same scale as education_score, our outcome. If we decide this model is good enough for our purposes, we can then use last_fit() to fit one final time to the training data and evaluate one final time on the testing data.

edu_fit <- last_fit(edu_wf, edu_split)
collect_metrics(edu_fit)

## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard       2.55  Preprocessor1_Model1
## 2 rsq     standard       0.492 Preprocessor1_Model1

Notice that this is the first time we’ve used the testing data. Our metrics on the testing data are about the same as from our resampling folds. Let’s evaluate the predictions on the testing data visually:

collect_predictions(edu_fit) |>
  ggplot(aes(education_score, .pred)) +
  geom_abline(slope = 1, lty = 2, linewidth = 1.5, alpha = 0.7) +
  geom_point() +
  coord_fixed()

We can see from this visualization that we are doing a worse job predicting the highest educational attainment scores. This is a common problem to run into when training models, and it is something to keep in mind as we think about how to use this model in practice.

Deploy our model

So far, we have covered EDA and model development, including evaluating our model. Now we can deploy our model for use in production. Let’s extract the final fitted workflow from our last_fit() object, and then create a deployable model bundle using vetiver.

library(vetiver)
v <- extract_workflow(edu_fit) |>
  vetiver_model(
    "uk-edu-rf", 
    metadata = list(metrics = collect_metrics(edu_fit))
  )
v

## 
## ── uk-edu-rf ─ <bundled_workflow> model for deployment 
## A ranger regression modeling workflow using 5 features

When you use vetiver_model(), some model metadata is automatically included, but we also can store custom metadata that we want to use later, like the metrics from model development. After our deployable model bundle is created, we can store and version it using pins. We’ve been working on Posit Workbench so far, and now for the first time we are going to be using Posit Connect, as our deployment target for both the model binary object and model API.

library(pins)
board <- board_connect()
board |> vetiver_pin_write(v)

When we use vetiver_pin_write(), we store and publish a versioned model binary together with its metadata on Connect. Posit Connect can be used as a model repository with this approach. After the versioned model object is safely stored, we can deploy an API to generate predictions using this model:

vetiver_deploy_rsconnect(
  board, 
  "julia.silge/uk-edu-rf", 
  predict_args = list(debug = TRUE)
)

Because we are using Posit Team, we automatically get exactly the right versions of the packages needed for prediction installed in the production environment using Posit Package Manager. It’s pretty fast too!

We automatically get some default visual documentation for our model, which can be shared with colleagues who need to know how the model API works and how to interact with it:

Since we deployed our model as a REST API, anyone can query it using curl or their own choice of programming language or similar. There is special support for getting predictions in R as well, treating the endpoint for predictions almost the same as a model in memory. First we create a vetiver endpoint object:

url <- "https://pub.palm.ptd.posit.it/uk-edu-rf/predict"
endpoint <- vetiver_endpoint(url)
endpoint

## 
## ── A model API endpoint for prediction: 
## https://pub.palm.ptd.posit.it/uk-edu-rf/predict

Then we can use the predict() function to get predictions from the model API. We can pass in a dataframe of new data to get predictions for, and we can also pass in any necessary headers. Here we are using an API key for authentication, since I set the access to this API to require that:

connect_auth <- httr::add_headers(
  Authorization = paste("Key", Sys.getenv("CONNECT_API_KEY"))
)

predict(
  endpoint, 
  slice_sample(edu_test, n = 10),
  connect_auth
)

## # A tibble: 10 × 1
##    .pred
##    <dbl>
##  1 -2.84 
##  2  2.52 
##  3  0.678
##  4  2.69 
##  5 -2.78 
##  6  3.78 
##  7 -1.24 
##  8 -2.54 
##  9 -2.51 
## 10 -2.79

Using Posit Team throughout the ML lifecycle provides a great experience, and we were able to:

Explore our data and develop our model on Posit Workbench
Deploy our model to Posit Connect
Use the correct versions of packages for prediction with Posit Package Manager

At the same time, since we used open source packages like tidymodels and vetiver for model development and deployment, we aren’t locked in to any given infrastructure or set of professional tools. We can use the same code and the same models in an entirely different environment if we need to!

Posted on:: April 24, 2024

Length:: 7 minute read, 1363 words

Categories:: rstats tidymodels

Tags:: rstats tidymodels

See Also:: Explore #TidyTuesday literary prizes with Positron’s Data Explorer; Release an R package with Positron; Positron in action with #TidyTuesday orca encounters