Word and Character Count of Blog Posts on Covid19, with R

Rees Morrison
4 min readAug 27, 2020

Continuing my series on blog posts that combine COVID-19 data and R , this piece looks at the length of the posts. The length of a blog post, as measured by the number of characters in it, gives an indication of how deeply into a topic the post delves. Someone looking for guidance and ideas might plausibly assume that the more words the blogger writes about a topic , the better the post explores the topic. If a post explains the data sources it uses or sophisticated mathematical techniques, even more length would be expected .

Let’s look at the number of characters in the 215 posts collected so far in this series.

Using the base R function nchar, we found a median of 9,339 characters for the posts. Nchar counts punctuation and spaces, so disregarding both means the median post would contain approximately 1,500 words (the average word length in the English language is 4.7 characters). In this set of posts, the first quartile number of characters is 5,404 while the third quartile is 17,403. In terms of words, where one finds approximately 143 words per thousand characters, the median post would have approximately 1,335 words. A tidytext package word count would generate another count of words, but for this article, the broader description is sufficient. See the technical note at the end if the vagaries of word counts piques your interest.

To portray the distribution of characters, here are four plots generated with the function InspectVariable from the DataVisualizations package.

In the upper left, a histogram shows that the bulk of the posts have 20,000 characters or fewer.

The plot on the right shows a Pareto Density Estimation (PDE) for the number of characters. According to the package, “PDE consists of a kernel density estimator representing the relative likelihood of a given continuous random data. The parameters of the kernels are auto-adopted to the data using an information theoretic optimum on skewed distributions.” For more on PDE, see this article.

The lower left plot shows a Normal Q-Q plot. A Q-Q plot is a scatterplot created by plotting two sets of quantiles against one another, as described here. If both sets of points came from the same distribution, the points should form a line that’s roughly straight. The shape of this comparison of a theoretical normal distribution on the x-axis to the distribution of numbers of characters on the y-axis suggests from the bulge downward that the latter is skewed.

The lower right plot box plot to show the distribution of characters in yet another way. Quite a few lengthy posts stand out as outliers. To the right of the boxplot is a plot that shows we have no missing data in the character counts.

One might surmise that the length of a post corresponds roughly to the occupation of the blogger: professors and other academics probably wax loquacious more often than do corporate researchers. To test our hypothesis, the graphic that follows displays the number of characters in blog posts according to the blogger’s core role on the y-axis. Indeed, academics write much longer posts at times, as several of their dots extend far to the right. As the subtitle states, the red diamonds indicate the median number of characters in the posts for each core role.

Adherents of fanplots, such as the one below, maintain that they are a better alternative to pie charts. They represent amounts of values. The traditional pie plot is difficult for most people to interpret, because humans are not trained well to observe and interpret angles. This fan plot was generated by the plotrix package and its Fanplot function. It shows that the longer-winded bloggers, the academics, also account for slightly more than half the bloggers in this set.

In closing, this data set keeps growing and still needs some wrangling. A few posts collected early on contain comments to the post, which should not be included in the number of characters or words. Also, for quite a few posts, I copied into the Word file the R code saved by the blogger on GitHub. Where a post has extensive R code, that drives up the character count, but adds a different textual dimension than prose explaining a topic.

Technical Note on Word Counts: We looked at the first blog post that we have discovered in English that addresses both covid-19 and R. Holger Von Jouanne-Diedrich published on Feb. 4, 2020 on his Learning Machines blog a piece entitled “Epidemiology: How contagious is Novel Coronavirus [2019-nCoV]?”. Read in from Microsoft Word to a dataframe, R’s nchar function counts 6,386 characters (spaces and punctuation included; 915 words at 143 per 1,000 characters). According to Microsoft Word, the post has 1,112 words. The unnest_tokens function of the tidytext package finds 1,358 “words”, with individual numbers counting as words as does R code and code line numbers. It is beyond the scope or interest of this article to reconcile the three “word” counts.

--

--

Rees Morrison

An enthusiast of R programming, surveys, and data analysis/visualization