# Scraping CRAN with rvest

I am one of the organizers for a session at userR 2017 this coming July that will focus on discovering and learning about R packages. How do R users find packages that meet their needs? Can we make this process easier? As somebody who is relatively new to the R world compared to many, this is a topic that resonates with me and I am happy to be part of the discussion. I am working on this session with John Nash and Spencer Graves, and we hope that some useful discussion and results come out of the session.

In preparation for this session, I wanted to look at the distribution of R packages by date, number of version, etc. There have been some great plots that came out around the time when CRAN passed the 10,000 package mark but most of the code to make those scripts involve packages and idioms I am less familiar with, so here is an rvest and tidyverse centered version of those analyses!

## Scraping CRAN

The first thing we need to do is get all the packages that are currently available on CRAN. Let’s use rvest to scrape the page that lists all the packages currently on CRAN. It also has some other directories besides packages so we can use filter to remove the things that don’t look like R packages.

So that’s currently available packages!

Now let’s turn to the archive. Let’s do a similar operation.

That is good, but now we need to get more detailed information for packages that have been archived at least once to get the date they originally were released and how many versions they have had.

## Visiting every page in the archive

Let’s set up a function for scraping an individual page for a package and apply that to every page in the archive. This step takes A WHILE because it queries a web page for every package in the CRAN archive. I’ve set this up with map from purrr; it is one of my favorite ways to organize tasks these days.

What do these pages look like?

This is exactly what we need: the dates that the packages were released and how many times they have been released. Let’s use mutate and map again to extract these values.

## Putting it together

Now it’s time to join the data from the currently available packages and the archives.

• Packages that are in archives but not pkgs are no longer on CRAN.
• Packages that are in pkgs but not archives only have one CRAN release.
• Packages that are in both dataframes have had more than one CRAN release.

Sounds like a good time to use anti_join and inner_join.

## Plotting results

Let’s look at some results now.

There we go! That is similar to the results we all saw going around when CRAN passed 10,000 packages, which is good.

What about the number of archived vs. available packages?

And lastly, let’s look at the distribution of number of releases for each package.

## The End

It is pretty ironic that I worked on this code and wrote this post because I wanted to do an analysis using different packages than the ones used in the original scripts shared. That is exactly part of the challenge facing all of us as R users now that there is such a diversity of tools out there! I hope that our session at useR this summer provides some clarity and perspective for attendees on these types of issues. The R Markdown file used to make this blog post is available here. Bob Rudis has let me know that there are easier ways to get the data that I used for these plots, and I am very happy to hear about that or other feedback and questions!