rstats

Predict #TidyTuesday NYT bestsellers

Will a book be on the NYT bestseller list a long time, or a short time? We walk through how to use wordpiece tokenization for the author names, and how to deploy your model as a REST API.

Predicting viewership for #TidyTuesday Doctor Who episodes

Using a tidymodels workflow can make many modeling tasks more convenient, but sometimes you want more flexibility and control of how to handle your modeling objects. Learn how to handle resampled workflow results and extract the quantities you are interested in.

Create a custom metric with tidymodels and NYC Airbnb prices

Predict prices for Airbnb listings in NYC with a data set from a recent episode of SLICED, with a focus on two specific aspects of this model analysis: creating a custom metric to evaluate the model and combining both tabular and unstructured text data in one model.

Multinomial classification with tidymodels and #TidyTuesday volcano eruptions

Lately I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to evaluate complex models. Today’s screencast demonstrates how to implement multiclass or multinomial classification using with this week’s #TidyTuesday dataset on volcanoes. 🌋 Here is the code I used in the video, for those who prefer reading instead of or in addition to video. Explore the data Our modeling goal is to predict the type of volcano from this week’s #TidyTuesday dataset based on other volcano characteristics like latitude, longitude, tectonic setting, etc.

Sentiment analysis with tidymodels and #TidyTuesday Animal Crossing reviews

A lot has been happening in the tidymodels ecosystem lately! There are many possible projects we on the tidymodels team could focus on next; we are interested in gathering community feedback to inform our priorities. If you are interested in sharing your opinion on next steps in tidymodels development, please take this short survey. Lately I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to tune more complex models.

Modeling #TidyTuesday GDPR violations with tidymodels

This is an exciting week for us on the tidymodels team; we launched tidymodels.org, a new central location with resources and documentation for tidymodels packages. There is a TON to explore and learn there! 🚀 You can check out the official blog post for more details. Today, I’m publishing here on my blog another screencast demonstrating how to use tidymodels. This is a good video for folks getting started with tidymodels, using this week’s #TidyTuesday dataset on GDPR violations.

PCA and the #TidyTuesday best hip hop songs ever

Lately I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to tune more complex models. Today, I’m exploring a different part of the tidymodels framework; I’m showing how to implement principal component analysis via recipes with this week’s #TidyTuesday dataset on the best hip hop songs of all time as determinded by a BBC poll of music critics. Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Bootstrap resampling with #TidyTuesday beer production data

I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to tune more complex models. Today, I’m using this week’s #TidyTuesday dataset on beer production to show how to use bootstrap resampling to estimate model parameters. Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Tuning random forest hyperparameters with #TidyTuesday trees data

I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to tune more complex models. Today, I’m using a #TidyTuesday dataset from earlier this year on trees around San Francisco to show how to tune the hyperparameters of a random forest model and then use the final best model. Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

LASSO regression using tidymodels and #TidyTuesday data for The Office

I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to tune more complex models. Today, I’m using this week’s #TidyTuesday dataset on The Office to show how to build a lasso regression model and choose regularization parameters! Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Preprocessing and resampling using #TidyTuesday college data

I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first getting started to how to tune machine learning models. Today, I’m using this week’s #TidyTuesday dataset on college tuition and diversity at US colleges to show some data preprocessing steps and how to use resampling! Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Hyperparameter tuning and #TidyTuesday food consumption

Last week I published a screencast demonstrating how to use the tidymodels framework and specifically the recipes package. Today, I’m using this week’s #TidyTuesday dataset on food consumption around the world to show hyperparameter tuning! Here is the code I used in the video, for those who prefer reading instead of or in addition to video. Explore the data Our modeling goal here is to predict which countries are Asian countries and which countries are not, based on their patterns of food consumption in the eleven categories from the #TidyTuesday dataset.

#TidyTuesday hotel bookings and recipes

Last week I published my first screencast showing how to use the tidymodels framework for machine learning and modeling in R. Today, I’m using this week’s #TidyTuesday dataset on hotel bookings to show how to use one of the tidymodels packages recipes with some simple models! Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

#TidyTuesday and tidymodels

This week I started my new job as a software engineer at RStudio, working with Max Kuhn and other folks on tidymodels. I am really excited about tidymodels because my own experience as a practicing data scientist has shown me some of the areas for growth that still exist in open source software when it comes to modeling and machine learning. Almost nothing has had the kind of dramatic impact on my productivity that the tidyverse and other RStudio investments have had; I am enthusiastic about contributing to that kind of user-focused transformation for modeling and machine learning.

Modeling salary and gender in the tech industry

One of the biggest projects I have worked on over the past several years is the Stack Overflow Developer Survey, and one of the most unique aspects of this survey is the extensive salary data that is collected. This salary data is used to power the Stack Overflow Salary Calculator, and has been used by various folks to explore how people who use spaces make more than those who use tabs, whether that’s just a proxy for open source contributions, and more.

Opioid prescribing habits in Texas

A paper I worked on was just published in a medical journal. This is quite an odd thing for me to be able to say, given my academic background and the career path I have had, but there you go! The first author of this paper is a long-time friend of mine working in anesthesiology and pain management, and he obtained data from the Texas Prescription Drug Monitoring Program (PDMP) about controlled substance prescriptions from April 2015 to 2018.

(Re)Launching my supervised machine learning course

Today I am happy to announce a new(-ish), free, online, interactive course that I have developed, Supervised Machine Learning: Case Studies in R! 💫 Supervised machine learning in R Predictive modeling, or supervised machine learning, is a powerful tool for using data to make predictions about the world around us. Once you understand the basic ideas of supervised machine learning, the next step is to practice your skills so you know how to apply these techniques wisely and appropriately.

Practice using lubridate… THEATRICALLY

I am so pleased to now be an RStudio-certified tidyverse trainer! 🎉 I have been teaching technical content for decades, whether in a university classroom, developing online courses, or leading workshops, but I still found this program valuable for my own professonal development. I learned a lot that is going to make my teaching better, and I am happy to have been a participant. If you are looking for someone to lead trainings or workshops in your organization, you can check out this list of trainers to see who might be conveniently located to you!

Introducing tidylo

Today I am so pleased to introduce a new package for calculating weighted log odds ratios, tidylo. Often in data analysis, we want to measure how the usage or frequency of some feature, such as words, differs across some group or set, such as documents. One statistic often used to find these kinds of differences in text data is tf-idf. Another option is to use the log odds ratio, but the log odds ratio alone does not account for sampling variability.

Reordering and facetting for ggplot2

I recently wrote about the release of tidytext 0.2.1, and one of the most useful new features in this release is a couple of helper functions for making plots with ggplot2. These helper functions address a class of challenges that often arises when dealing with text data, so we’ve included them in the tidytext package. Let’s work through an example To show how to use these new functions, let’s walk through a more general example that does not deal with results that come from unstructured, free text.

Fixing your mistakes: sentiment analysis edition

Today tidytext 0.2.1 is available on CRAN! This new release of tidytext has a collection of nice new features. Bug squashing! 🐛 Improvements to error messages and documentation 📃 Switching from broom to generics for lighter dependencies Addition of some helper plotting functions I look forward to blogging about soon An additional change is significant and may be felt by you, the user, so I want to share a bit about it.

Relaunching the qualtRics package

Note: cross-posted with the rOpenSci blog. rOpenSci is one of the first organizations in the R community I ever interacted with, when I participated in the 2016 rOpenSci unconf. I have since reviewed several rOpenSci packages and been so happy to be connected to this community, but I have never submitted or maintained a package myself. All that changed when I heard the call for a new maintainer for the qualtRics package.

Writing a letter to DataCamp

Since 2017 I have been an instructor for DataCamp, the VC-backed online data science education platform. What this means is that I am not an employee, but I have developed content for the company as a contractor. I have two courses there, one on text mining and one on practical supervised machine learning. About two weeks ago, DataCamp published a blog post outlining an incident of sexual misconduct at the company.

Read all about it! Navigating the R Package Universe

In the most recent issue of the R Journal, I have a new paper out with coauthors John Nash and Spencer Graves. Check out the abstract: Today, the enormous number of contributed packages available to R users outstrips any given user’s ability to understand how these packages work, their relative merits, or how they are related to each other. We organized a plenary session at useR!2017 in Brussels for the R community to think through these issues and ways forward.

Feeling the rstudio::conf ❤️

I am heading home from my third year of attending rstudio::conf! If you weren’t there, watch for the videos to be released so you can check out the talks; I know I will do the same so I can see the talks I was forced to miss by scheduling constraints. I love this conference, and once again this year, the organizers have succeeded in building an impactful, valuable, inclusive conference.

Text classification with tidy data principles

I am an enthusiastic proponent of using tidy data principles for dealing with text data. This kind of approach offers a fluent and flexible option not just for exploratory data analysis, but also for machine learning for text, including both unsupervised machine learning and supervised machine learning. I haven’t written much about supervised machine learning for text, i.e. predictive modeling, using tidy data principles, so let’s walk through an example workflow for this a text classification task.

Word associations from the Small World of Words

Do you subscribe to the Data is Plural newsletter from Jeremy Singer-Vine? You probably should, because it is a treasure trove of interesting datasets arriving in your email inbox. In the November 28 edition, Jeremy linked to the Small World of Words project, and I was entranced. I love stuff like that, all about words and how people think of them. I have been mulling around a blog post ever since, and today I finally have my post done, so let’s see what’s up!

TensorFlow, Jane Austen, and Text Generation

I remember the first time I saw a deep learning text generation project that was truly compelling and delightful to me. It was in 2016 when Andy Herd generated new Friends scenes by training a recurrent neural network on all the show’s episodes. Herd’s work went pretty viral at the time and I thought: via GIPHY And also: via GIPHY At the time I dabbled a bit with Andrej Karpathy’s tutorials for character-level RNNs; his work and tutorials undergird a lot of the kind of STUNT TEXT GENERATION work we see in the world.

Training, evaluating, and interpreting topic models

At the beginning of this year, I wrote a blog post about how to get started with the stm and tidytext packages for topic modeling. I have been doing more topic modeling in various projects, so I wanted to share some workflows I have found useful for training many topic models at one time, evaluating topic models and understanding model diagnostics, and exploring and interpreting the content of topic models.

Amazon Alexa and Accented English

Earlier this spring, one of my data science friends here in SLC got in contact with me about some fun analysis. My friend Dylan Zwick is a founder at Pulse Labs, a voice-testing startup, and they were chatting with the Washington Post about a piece on how devices like Amazon Alexa deal with accented English. The piece is published today in the Washington Post and turned out really interesting! Let’s walk through the analysis I did for Dylan and Pulse Labs.

Punctuation in literature

This morning I was scrolling through Twitter and noticed Alberto Cairo share this lovely data visualization piece by Adam J. Calhoun about the varying prevalence of punctuation in literature. I thought, “I want to do that!” It also offers me the opportunity to chat about a few of the new options available for tokenizing in tidytext via updates to the tokenizers package. Adam’s original piece explores how punctuation is used in nine novels, including my favorite Pride and Prejudice.

Public Data Release of Stack Overflow’s 2018 Developer Survey

Note: Cross-posted with the Stack Overflow blog. Starting today, you can access the public data release for Stack Overflow’s 2018 Developer Survey. Over 100,000 developers from around the world shared their opinions about everything from their favorite technologies to job preferences, and this data is now available for you to analyze yourself. This year, we are partnering with Kaggle to publish and highlight this dataset. This means you can access the data both here on our site and on Kaggle Datasets, and that on Kaggle, you can explore the dataset using Kernels.

Understanding PCA using Stack Overflow data

This year, I have given some talks about understanding principal component analysis using what I spend day in and day out with, Stack Overflow data. You can see a recording of one of these talks from rstudio::conf 2018. When I have given these talks, I’ve focused a lot on understanding PCA. This blog post walks through how I implemented PCA and how I made the plots I used in my talk.

Stack Overflow questions around the world

I am so lucky to work with so many generous, knowledgeable, and amazing people at Stack Overflow, including Ian Allen and Kirti Thorat. Both Ian and Kirti are part of biweekly sessions we have at Stack Overflow where several software developers join me in practicing R, data science, and modeling skills. This morning, the two of them went to a high school outreach event in NYC for students who have been studying computer science, equipped with Stack Overflow ✨ SWAG ✨, some coding activities based on Stack Overflow internal tools and packages, and a Shiny app that I developed to share a bit about who we are and what we do.

The game is afoot! Topic modeling of Sherlock Holmes stories

In a recent release of tidytext, we added tidiers and support for building Structural Topic Models from the stm package. This is my current favorite implementation of topic modeling in R, so let’s walk through an example of how to get started with this kind of modeling, using The Adventures of Sherlock Holmes. via GIPHY You can watch along as I demonstrate how to start with the raw text of these short stories, prepare the data, and then implement topic modeling in this video tutorial!

tidytext 0.1.6

I am pleased to announce that tidytext 0.1.6 is now on CRAN! Most of this release, as well as the 0.1.5 release which I did not blog about, was for maintenance, updates to align with API changes from tidytext’s dependencies, and bugs. I just spent a good chunk of effort getting tidytext to pass R CMD check on older versions of R despite the fact that some of the packages in tidytext’s Suggests require recent versions of R.

Tidy word vectors, take 2!

A few weeks ago, I wrote a post about finding word vectors using tidy data principles, based on an approach outlined by Chris Moody on the StitchFix tech blog. I’ve been pondering how to improve this approach, and whether it would be nice to wrap up some of these functions in a package, so here is an update! Like in my previous post, let’s download half a million posts from the Hacker News corpus using the bigrquery package.

New sports from random emoji

I love emoji ❤️ and I love xkcd, so this recent comic from Randall Munroe was quite a delight for me. I sat there, enjoying the thought of these new sports like horse hole and multiplayer avocado and I thought, “I can make more of these in just the barest handful of lines of code”. This is largely thanks to the emo package by Hadley Wickham, which if you haven’t installed and started using yet, WHY NOT?

Word Vectors with tidy data principles

Last week I saw Chris Moody’s post on the Stitch Fix blog about calculating word vectors from a corpus of text using word counts and matrix factorization, and I was so excited! This blog post illustrates how to implement that approach to find word vector representations in R using tidy data principles and sparse matrices. Word vectors, or word embeddings, are typically calculated using neural networks; that is what word2vec is.

From Power Calculations to P-Values: A/B Testing at Stack Overflow

Note: cross-posted with the Stack Overflow blog. If you hang out on Meta Stack Overflow, you may have noticed news from time to time about A/B tests of various features here at Stack Overflow. We use A/B testing to compare a new version to a baseline for a design, a machine learning model, or practically any feature of what we do here at Stack Overflow; these tests are part of our decision-making process.

Mapping ecosystems of software development

I have a new post on the Stack Overflow blog today about the complex, interrelated ecosystems of software development. On the data team at Stack Overflow, we spend a lot of time and energy thinking about tech ecosystems and how technologies are related to each other. One way to get at this idea of relationships between technologies is tag correlations, how often technology tags at Stack Overflow appear together relative to how often they appear separately.

tidytext 0.1.4

I am pleased to announce that tidytext 0.1.4 is now on CRAN! This release of our package for text mining using tidy data principles has an excellent collection of delightfulness in it. First off, all the important functions in tidytext now support support non-standard evaluation through the tidyeval framework. library(janeaustenr) library(tidytext) library(dplyr) input_var <- quo(text) output_var <- quo(word) data_frame(text = prideprejudice) %>% unnest_tokens(!! output_var, !! input_var) ## # A tibble: 122,204 x 1 ## word ## <chr> ## 1 pride ## 2 and ## 3 prejudice ## 4 by ## 5 jane ## 6 austen ## 7 chapter ## 8 1 ## 9 it ## 10 is ## # .

Sentiment analysis using tidy data principles at DataCamp

NOTE: Read more here about why I no longer recommend taking my courses at DataCamp. I’ve been developing a course at DataCamp over the past several months, and I am happy to announce that it is now launched! The course is Sentiment Analysis in R: the Tidy Way and I am excited that it is now available for you to explore and learn from. This course focuses on digging into the emotional and opinion content of text using sentiment analysis, and it does this from the specific perspective of using tools built for handling tidy data.

Understanding gender roles in movies with text mining

I have a new visual essay up at The Pudding today, using text mining to explore how women are portrayed in film. The R code behind this analysis in publicly available on GitHub. I was so glad to work with the talented Russell Goldenberg and Amber Thomas on this project, and many thanks to Matt Daniels for inviting me to contribute to The Pudding. I’ve been a big fan of their work for a long time!

Seeking guidance in choosing and evaluating R packages

At useR!2017 in Brussels last month, I contributed to an organized session focused on navigating the 11,000+ packages on CRAN. My collaborators on this session and I recently put together an overall summary of the session and our goals, and now I’d like to talk more about the specific issue of learning about R packages and deciding which ones to use. John and Spencer will write more soon about the two other issues of our focus:

Navigating the R Package Universe

Earlier this month, I, along with John Nash, Spencer Graves, and Ludovic Vannoorenberghe, organized a session at useR!2017 focused on discovering, learning about, and evaluating R packages. You can check out the recording of the session. There are more than 11,000 packages on CRAN, and R users must approach this abundance of packages with effective strategies to find what they need and choose which packages to invest time in learning how to use.

Text Mining of Stack Overflow Questions

Note: Cross-posted with the Stack Overflow blog. This week, my fellow Stack Overflow data scientist David Robinson and I are happy to announce the publication of our book Text Mining with R with O’Reilly. We are so excited to see this project out in the world, and so relieved to finally be finished with it! Text data is being generated all the time around us, in healthcare, finance, tech, and beyond; text mining allows us to transform that unstructured text data into real insight that can increase understanding and inform decision-making.

Using tidycensus and leaflet to map Census data

Recently, I have been following the development and release of Kyle Walker’s tidycensus package. I have been filled with amazement, delight, and well, perhaps another feeling… There should be a word for “the regret felt when an R 📦, which would have saved untold hours of your life, is released”… #rstats 🤔 https://t.co/2THN4MwedO — Mara Averick (@dataandme) May 31, 2017 But seriously, I have worked with US Census data a lot in the past and this package

tidytext 0.1.3

I am pleased to announce that tidytext 0.1.3 is now on CRAN! In this release, my collaborator David Robinson and I have fixed a handful of bugs, added tidiers for LDA models from the mallet package, and updated functions for changes to quanteda’s API. You can check out the NEWS for more details on changes. One enhancement in this release is the addition of the Loughran and McDonald sentiment lexicon of words specific to financial reporting.

Text Mining in R: A Tidy Approach

I spoke on approaching text mining tasks using tidy data principles at rstudio::conf yesterday. I was so happy to have the opportunity to speak and the conference has been a great experience. If you want to catch up on what has been going on at rstudio::conf, Karl Broman put together a GitHub repo of slides and Sharon Machlis has been live-blogging the conference at Computerworld. A highlight for me was Andrew Flowers' talk on data journalism and storytelling; I don’t work in data journalism but I think I can apply almost everything he said to how I approach what I do.

Water World

Exploring and Predicting Water Use in Salt Lake City

Joy to the World, and also Anticipation, Disgust, Surprise…

In my previous blog post, I analyzed my Twitter archive and explored some aspects of my tweeting behavior. When do I tweet, how much do retweet people, do I use hashtags? These are examples of one kind of question, but what about the actual verbal content of my tweets, the text itself? What kinds of questions can we ask and answer about the text in some programmatic way? This is what is called natural language processing, and I’ll give a first shot at it here.

Ten Thousand Tweets

I started learning the statistical programming language R this past summer, and discovering Hadley Wickham’s data visualization package ggplot2 has been a joy and a revelation. When I think back to how I made all the plots for my astronomy dissertation in the early 2000s (COUGH SUPERMONGO COUGH), I feel a bit in awe of what ggplot2 can do and how easy and, might I even say, delightful it is to use.