Covid19-related blog posts and the R packages they use

As my first write-up explains, my goal in collecting blog posts about Covid19 that use R is to help people who want to analyze coronavirus data and want to do so with R. The motivation for this follow-on article is to look at the R packages used by the bloggers of my ever-growing collection, so that researchers can learn from the code of others.

You can read more about the data set and its history on the first post. It focuses on dates of first publication, the jobs of the bloggers, and the countries where they live. These findings will be updated periodically as more qualifying posts come to light.

Here we look at the the R packages used.

For those interested in the sausage-making, I laboriously copied each of the blog posts into separate Word files and reviewed them for R packages that the authors said that they used.

It turns out that some of the authors do not include code or they direct the reader to a repository such as GitHub. Where the code is available, I copied it (or portions of it) and inserted it in the Word text file. Even when the code is public, some programmers install packages in a group so I had to insert “library(“ before the package names, some programmers use require(), and some don’t do either but use the double colon technique, e.g., ggplot2::annotate(). All these irregularities I unraveled and standardized so that a regex statement could extract the R packages.

My next step turned to categorizing each package into one of eleven broad groups. The groups appear in the legends of the plots below. The plots omit the tidyverse and ggplot2 packages because they were ubiquitous.

Out of the 193 packages identified in the 127 blog posts collected as of a few days ago (though quite a few of those posts do not provide code), 92 packages (48%) were used by only a single blogger. Another 45 packages had but two bloggers use them. The plot below names the packages above the number of times a blogger used them, starting at three. The height of the package name column, with the packages listed alphabetically up from the x-axis, indicates how many packages were identified that many times. For example, the zoo package appears five times (rolling averages), along with a dozen other packages that also appeared five times. They are each spaced one unit apart on the y-axis.

Please write me at Rees(at)ReesMorrison(dot)com if you have corrections to the data set or know other blog posts that ought to be included in it.

An enthusiast of the four genre who likes to write (and use R software)