COVID-19 and R Coding Terms in Blog Posts, by Topic

Rees Morrison
4 min readJul 29, 2020

Summary: In a collection of blog posts that use R code and refer to COVID-19, the percentages of words in the posts for related terms don’t yield clear insights — but the direction of the text mining effort may have promise.

This article aims to expand the analysis of blog posts that use R to explore COVID19 data. Previous articles address the timing of posts since February , the R packages employed, the math used, as well as he countries of the bloggers and their work roles.

With 205 posts collected so far, this article draws on text-mining tools to consider the posts and their use of two vocabularies. The first vocabulary collects common (and distinctive) terms used by R programmers in their code (“aes”, “char”, “col”, “dbl”, “dplyr”, “element_blank”, “element_text”, “filter”, “function”, “geom_col”, “geom_line”, “geom_point”, “ggplot”, “ggplot2”, “group_by”, “hjust”, “ifelse”, “labs”, “library”, “max”, “na”, “nrow”, “package”, “parms”, “paste”, “paste0”, “plot”, “subtitle”, “theme”, “tidyverse”). Not having found a study of the frequency of terms used by the R community, I fashioned my own — accepting the inherent subjectivity and methodological challenges that entails.

The second vocabulary brings in the distinctive terminology of COVID-19. Starting with a set of such words that Prof. Kieran Healy put together, I added more as I read the posts. Drawing on nothing other than my personal sense of how common the terms are in ordinary use, I categorized each one as “basic” (38 of the words), “intermediate” (26 words), or “advanced” (11); in this article, those loose distinctions don’t matter as I combined all 75 terms together.

Here are the “basic” terms as they currently stand: “bodies”, “cases”, “contagion”, “corona”, “COVID”, “Covid19”, “covid-19”, “detection”, “death”, “died”, “doctor”, “emergency”, “epidemic”, “exposed”, “health”, “hospital”, “immune”, “immunity”, “infect”, “intensive care”, “liquid”, “lockdown”, “mask”, “nurse”, “pandemic”, “recovered”, “reproduction”, “sanitizers”, “shelter”, “social distance”, “spread”, “stay at home”, “swab”, “test”, “tracing”, “transmission”, “vaccine”, “virus”)

Using the tidytext package, I unnested the individual tokens in each blog post and dropped common stop words. Then I calculated the percentage of R terms and combined COVID-19 terms in each post.

The scatter plot shows each post according to its percentages for the two vocabularies. For example, on the far right one post devoted a little under 0.2% of its words to R vocabulary and even less than that to COVID-19 terms (close to 0.04%). The smoothed regression line shows that posts with more R terms also tended to used more COVID-19 terms, both on a percentage of total words. It may be that the code chunks embedded in the posts (or copied in by me from GitHub repositories) swells the percentage of R terms unjustifiably.

Do those percentages of the two special vocabularies correspond to the primary subjects of the blog posts? An earlier article in this series explains my assignment of a primary topic to each blog post. The second plot adds that factor to plot as it shapes each point according to the post’s primary topic. To help with legibility, this plot drops four topics that had few posts (CompSci (5), Economics (7), Sociology (7) and Text Mining (2)).

About the only apparent pattern is that epidemiology posts use a higher percentage of words from the vocabulary for COVID-19.

In retrospect and to be blunt, this effort fell far short of usefulness. Distinctive terms used by R programmers needs much more consideration and precision. Also, an analysis such as this would need to address R terms such as “code”, “false”, “mutate”, “scale”, and “true” that also have real-life uses. I excluded them here. Similarly, the pandemic-related terms need to be either stemmed or lemmatized so that the identification of all variants of such terms becomes more accurate. Plus, the set almost certainly overlooks some vocabulary of the pandemic (“antibody” and “pathogen” come to mind as I write this article).

Exacerbating the semantic ambiguity or incompleteness of this analysis, the assignment of primary topics introduces a considerable degree of subjectivity. Disappointingly, when you put it all together, it’s hard to conclude much at all from the effort.

Still, the aspirations of this work-in-process have encouraged me to write about it and perhaps someone will suggest a more fruitful and tighter analysis.

--

--

Rees Morrison

An enthusiast of R programming, surveys, and data analysis/visualization