Stack Overflow questions around the world

By Julia Silge

April 11, 2018

I am so lucky to work with so many generous, knowledgeable, and amazing people at Stack Overflow, including Ian Allen and Kirti Thorat. Both Ian and Kirti are part of biweekly sessions we have at Stack Overflow where several software developers join me in practicing R, data science, and modeling skills. This morning, the two of them went to a high school outreach event in NYC for students who have been studying computer science, equipped with Stack Overflow ✨ SWAG ✨, some coding activities based on Stack Overflow internal tools and packages, and a Shiny app that I developed to share a bit about who we are and what we do.

The Shiny app was fun to build and the results are interesting, so I thought I would share it here!

Data for the app

The data itself for this app tabulates questions asked on Stack Overflow over the past year, for the top several hundred tags in the top several hundred cities. The way I access this data involves our internal databases, but most of the data for questions and answers on Stack Overflow is public. In my opinion, the easiest way to get it in R is to use Google BigQuery and the bigrquery package.

After I query and aggregate the data I want, it looks like this.

## # A tibble: 131,801 x 8
##    Tag       Questions CityName    CountryName  Percent Latitude Longitude
##    <chr>         <int> <chr>       <chr>          <dbl>    <dbl>     <dbl>
##  1 javascri…    271426 London      United King…  0.0420     51.5    -0.126
##  2 python       220745 London      United King…  0.0470     51.5    -0.126
##  3 python       220745 New York    United Stat…  0.0399     40.7   -74.0  
##  4 javascri…    271426 Bengaluru   India         0.0314     13.0    77.6  
##  5 python       220745 San Franci… United Stat…  0.0375     37.8  -122.   
##  6 javascri…    271426 New York    United Stat…  0.0305     40.7   -74.0  
##  7 java         192107 Bengaluru   India         0.0429     13.0    77.6  
##  8 c#           139099 London      United King…  0.0471     51.5    -0.126
##  9 python       220745 Bengaluru   India         0.0286     13.0    77.6  
## 10 javascri…    271426 San Franci… United Stat…  0.0233     37.8  -122.   
## # ... with 131,791 more rows, and 1 more variable: LeafletPop <chr>

This data frame tells us for each tag on Stack Overflow, what is the total number of questions asked in the past year, and then for each city/metro area, what is the percent of that total that came from that city. The other columns (Latitude, Longitude, and LeafletPop) are what we are going to use for some mapping.

For example, we could look at just the R questions.

tag_city_questions %>%
    filter(Tag == "r")
## # A tibble: 522 x 8
##    Tag   Questions CityName      CountryName    Percent Latitude Longitude
##    <chr>     <int> <chr>         <chr>            <dbl>    <dbl>     <dbl>
##  1 r         59856 London        United Kingdom  0.0482     51.5    -0.126
##  2 r         59856 New York      United States   0.0399     40.7   -74.0  
##  3 r         59856 San Francisco United States   0.0273     37.8  -122.   
##  4 r         59856 Washington    United States   0.0270     38.9   -77.0  
##  5 r         59856 Amsterdam     Netherlands     0.0249     52.4     4.89 
##  6 r         59856 Boston        United States   0.0217     42.4   -71.1  
##  7 r         59856 Zurich        Switzerland     0.0201     47.4     8.55 
##  8 r         59856 Paris         France          0.0196     48.9     2.35 
##  9 r         59856 Bengaluru     India           0.0190     13.0    77.6  
## 10 r         59856 Toronto       Canada          0.0173     43.7   -79.4  
## # ... with 512 more rows, and 1 more variable: LeafletPop <chr>

These are some of the biggest cities in the world, with the largest developer populations. R is used proportionally more in some places than others, so the rankings of these cities will be different than, say, JavaScript.

Building a Leaflet map

For this app to share with the high school students, we wanted to have an interactive map so I turned to my standard option for that, Leaflet. How do we get started with a non-Shiny Leaflet map?


tag_city_questions %>%
    filter(Tag == "r") %>%
    leaflet(width = "100%") %>%
    addProviderTiles("CartoDB.Positron") %>%
    setView(lng = -90, lat = 40, zoom = 4) %>%
    addCircles(lng = ~Longitude, lat = ~Latitude, 
               color = NULL,
               radius = ~sqrt(Percent) * 5e5, popup = ~LeafletPop,
               fillColor = "blue", fillOpacity = 0.7)

An interactive map! With popup text and everything! But looking at a map like this with only one technology on it isn’t the most informative, because it mostly just shows us where developers live. You know, like that xkcd. What would be more interesting would be to compare multiple technologies at once, to be able to see where technologies are asked about proportionally more and less. To do that, let’s use Shiny.

Building a Shiny app

To make a Leaflet map for Shiny, it’s not too many more steps. The parts of the map that do not need to be redrawn from scratch with every user input go into a call to leafletOutput(), and then the parts of the map that do need to update with user input go into leafletProxy(). After a few more steps, the Shiny app is ready to go!


You can see the code that makes this app by visiting it and clicking on “Source Code” in the upper right.

This is quite interesting now. For example, you can see how anomalous the relative balance of TensorFlow to JavaScript is in the Bay Area compared to most other cities in the US. Lots of other tags to explore too!

One of the best things about working at Stack Overflow is the coworkers I have, and Ian and Kirti are great examples of that. I was really happy to get to help out with their high school coding outreach for this morning, and to build a fun Shiny app to boot. Let me know if you have any feedback or questions!

Posted on:
April 11, 2018
5 minute read, 1044 words
See Also:
Educational attainment in #TidyTuesday UK towns
Changes in #TidyTuesday US polling places
Empirical Bayes for #TidyTuesday Doctor Who episodes