Data sources in Covid19 posts that use R, and mathematical techniques used in the posts

3 min readJul 8, 2020

Building on the two previous posts regarding blog posts that address the Covid19 pandemic with R, I identified from the posts mathematical techniques that the bloggers used. The first post describes my project and provides background information about the bloggers, their roles, and their counties; the second post pulls together the frequencies of R packages used in the posts.

The identification of math techniques covered by this post was without a doubt quite subjective. Sometimes the text highlighted the mathematical underpinnings, while other times the our code indicated it.Quite a few of the posts do not employ any particular mathematics of note, while at the same time I probably missed or mis-characterized some of the techniques that have been put to work.

In any case, the preliminary compilation laid out in the table below aims to help researchers and analysts who have a particular interest in a mathematical technique can find posts that make use of it. If anyone would like to learn more about a blog that uses a specific technique, please contact me.

Let’s turn from math to data sets. The more data sources the R community knows for purposes of exploring the coronavirus pandemic, the better their insights will be. Perhaps data mashups will shed new light; perhaps inconsistencies between data sets will yield improvements; perhaps better indices can be constructed. For that reason, as with mathematical techniques, I extracted from the ever-growing collection of qualifying blog posts information about their data sources.

For the most part, the bloggers identified the data source or sources they drew on for their data analytics. However, as you can imagine, the formats and names given for those references varies widely. As I read the blog posts, I tried both to identify the sources and standardize the names of them. Even more of that clean up continued in R as I wrestled the information into a form that could be cleanly reported. Even so, I did not identify all the U.S. sources consistently. For example, the USDA should be called the U.S. Department of Agriculture.

To be honest, the table below represents a first cut at the results. One takeaway can’t help but notice the absence of the U.S. CDC agency. The other takeaway emphasizes the dominance of Johns Hopkins, who early and comprehensively and consistently has set the standard for Covid19 data collection and dissemination to the public.

(As a side note, preparing these tables gave me an opportunity to try out the gt package.)

Data sources in Covid19 posts that use R, and mathematical techniques used in the posts

Written by Rees Morrison