Getting started with k-means and #TidyTuesday employment status

By Julia Silge in rstats tidymodels

February 24, 2021

This is the latest in my series of screencasts demonstrating how to use the tidymodels packages, from starting out with first modeling steps to tuning more complex models. Today’s screencast uses the broom package to tidy output from k-means clustering, with this week’s #TidyTuesday dataset on employment and demographics.


Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Explore the data

Our modeling goal is to use k-means clustering to explore employment by race and gender. This is a good screencast for folks who are more new to k-means and want to understand how to apply it to a real-world data set.

library(tidyverse)

employed <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-02-23/employed.csv")

Let’s start by focusing on the industry and occupation combinations available in this data, and average over the years available. We aren’t looking at any time trends, but instead at the demographic relationships.

employed_tidy <- employed %>%
  filter(!is.na(employ_n)) %>%
  group_by(occupation = paste(industry, minor_occupation), race_gender) %>%
  summarise(n = mean(employ_n)) %>%
  ungroup()

Let’s create a dataframe read for k-means. We need to center and scale the variables we are going to use, since they are on such different scales: the proportions of each category who are Asian, Black, or women and the total number of people in each category.

employment_demo <- employed_tidy %>%
  filter(race_gender %in% c("Women", "Black or African American", "Asian")) %>%
  pivot_wider(names_from = race_gender, values_from = n, values_fill = 0) %>%
  janitor::clean_names() %>%
  left_join(employed_tidy %>%
    filter(race_gender == "TOTAL") %>%
    select(-race_gender) %>%
    rename(total = n)) %>%
  filter(total > 1e3) %>%
  mutate(across(c(asian, black_or_african_american, women), ~ . / (total)),
    total = log(total),
    across(where(is.numeric), ~ as.numeric(scale(.)))
  ) %>%
  mutate(occupation = snakecase::to_snake_case(occupation))

employment_demo
## # A tibble: 230 x 5
##    occupation                          asian black_or_african_a…    women  total
##    <chr>                               <dbl>               <dbl>    <dbl>  <dbl>
##  1 agriculture_and_related_construct… -0.553             -0.410  -1.31    -1.48
##  2 agriculture_and_related_farming_f… -0.943             -1.22   -0.509    0.706
##  3 agriculture_and_related_installat… -0.898             -1.28   -1.38    -0.992
##  4 agriculture_and_related_manage_me… -1.06              -1.66   -0.291    0.733
##  5 agriculture_and_related_managemen… -1.06              -1.65   -0.300    0.750
##  6 agriculture_and_related_office_an… -0.671             -1.54    2.23    -0.503
##  7 agriculture_and_related_productio… -0.385             -0.0372 -0.622   -0.950
##  8 agriculture_and_related_professio… -0.364             -1.17    0.00410 -0.782
##  9 agriculture_and_related_protectiv… -1.35              -0.647  -0.833   -1.39
## 10 agriculture_and_related_sales_and… -1.35              -1.44    0.425   -1.36
## # … with 220 more rows

Implement k-means clustering

Now we can implement k-means clustering, starting out with three centers. What does the output look like?

employment_clust <- kmeans(select(employment_demo, -occupation), centers = 3)
summary(employment_clust)
##              Length Class  Mode
## cluster      230    -none- numeric
## centers       12    -none- numeric
## totss          1    -none- numeric
## withinss       3    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           3    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

The original format of the output isn’t as practical to deal with in many circumstances, so we can load the broom package (part of tidymodels) and use verbs like tidy(). This will give us the centers of the clusters we found:

library(broom)
tidy(employment_clust)
## # A tibble: 3 x 7
##      asian black_or_african_american  women  total  size withinss cluster
##      <dbl>                     <dbl>  <dbl>  <dbl> <int>    <dbl> <fct>
## 1  1.46                       -0.551  0.385  0.503    45     125. 1
## 2 -0.732                      -0.454 -0.820 -0.655    91     189. 2
## 3  0.00978                     0.704  0.610  0.393    94     211. 3

If we augment() the clustering results with our original data, we can plot any of the dimensions of our space, such as total employed vs. proportion who are Black. We can see here that there are really separable clusters but instead a smooth, continuous distribution from low to high along both dimensions. Switch out another dimension like asian to see that projection of the space.

augment(employment_clust, employment_demo) %>%
  ggplot(aes(total, black_or_african_american, color = .cluster)) +
  geom_point()

Choosing k

We used k = 3 but how do we know that’s right? There are lots of complicated or “more art than science” ways of choosing k. One way is to look at the total within-cluster sum of squares and see if it stops dropping off so quickly at some value for k. We can get that from another verb from broom, glance(); let’s try lots of values for k and see what happens to the total sum of squares.

kclusts <-
  tibble(k = 1:9) %>%
  mutate(
    kclust = map(k, ~ kmeans(select(employment_demo, -occupation), .x)),
    glanced = map(kclust, glance)
  )

kclusts %>%
  unnest(cols = c(glanced)) %>%
  ggplot(aes(k, tot.withinss)) +
  geom_line(alpha = 0.5, size = 1.2, color = "midnightblue") +
  geom_point(size = 2, color = "midnightblue")

I don’t see a major “elbow” 💪 but I’d say that k = 5 looks pretty reasonable. Let’s fit k-means again.

final_clust <- kmeans(select(employment_demo, -occupation), centers = 5)

To visualize this final result, let’s use plotly and add the occupation name to the hover so we can mouse around and see which occupations are more similar.

library(plotly)

p <- augment(final_clust, employment_demo) %>%
  ggplot(aes(total, women, color = .cluster, name = occupation)) +
  geom_point()

ggplotly(p, height = 500)

Remember that you can switch out the axes for asian or black_or_african_american to explore dimensions.

Posted on:
February 24, 2021
Length:
4 minute read, 833 words
Categories:
rstats tidymodels
Tags:
rstats tidymodels
See Also:
Educational attainment in #TidyTuesday UK towns
Changes in #TidyTuesday US polling places
Empirical Bayes for #TidyTuesday Doctor Who episodes