Analyze My Writing

Wikpedia vs. Project Gutenberg

In this snippet we reuse some data from some old ideas to compare the lexical density of Wikipedia articles
(informative writing) and project Gutenberg etexts (general prose).

Using Wikipedia's random article feature,
we fed a random sample of 150 entire Wikipedia articles (separately, of course) into our homepage
to determine lexical density of each article.
This yielded an average lexical density (as estimated by this website) of 56.46%.
The distribution is shown below.

Figure generated at stats.blue.

Figure generated at stats.blue.

Next, we analyzed a random sample of 70 "first chapters" of random e-texts from Project Gutenberg.
Our sample was taken by drawing a random e-book from
Project Gutenberg.
For each text, the webpage was refreshed and the first English language text to appear in the random list
was then chosen. Then the first chapter (or other natural text subdivision) was analyzed
lexical density using our homepage.

We adhered to the following practices when taking the sample:

1) Publications such as lists, recipe books, and poetry were not considered. |

2) Texts for which an author could not be attributed were also not considered.
For example, folk tales of unknown origin and authorship. |

3) Short Stories (which we defined to be 20000 words or less) were considered in their entirety
regardless of whether or not the text was subdivided into chapters or otherwise. |

4) Prefaces and other text preceding the main body of work were not considered. |

5) If a novel or novella-length text had no clear subdividing structure, we did not consider the text. |

6) Texts which were English translations were not considered. |

This yielded an average lexical density (as estimated by this website) of 49.03%.
The results are shown below.

Figure generated at stats.blue.

Figure generated at stats.blue.

We analyzed our samples using online statistical software at stats.blue.
Using the well-known $t$-procedures, we infer the following:

There is a 99% chance that the true average lexical density^{*}
of Wikipedia articles lies somwhere between 54.9% and 58.02%.

There is a 99% chance that the true average lexical density^{*}
of Project Gutenberg e-texts lies somwhere between 48.36% and 49.71%.

We see that the 99% confidence intervals do not overlap, so we ran a two-sample $t$-test:
$$
\begin{array}{ll}
H_0: & \mbox{The mean lexical densities are equal.} \\
H_a: & \mbox{The mean lexical density of Wikipedia articles is greater than that of Project Gutenberg e-texts.} \\
\end{array}
$$
The $p$-value to three decimal places is 0 (i.e., the $p$-value is VERY close to zero).
In plain language, assuming the true means really are equal,
the chances of seeing a data set like ours is virtually nil.
Đ¢his is exceedingly strong evidence that Wikipedia articles are in general more lexically dense
(as estimated by this website) than
project Gutenerg e-texts.

Our results conclusively show that Wikipedia articles are more lexically dense than
than Project Gutenberg e-texts with a mean difference of 7.43%.
They suggest that lexical density may be higher in informative writing than general prose,
but of course to verify such a general claim would require a lot more planning, data collection, and research.

Article on this Website: Lexical Density.

Johansson, V. (2008), Lexical diversity and lexical density in speech and writing: a developmental perspective, *Working Papers* 53, 61-79.

Ure, J. (1971), Lexical density and register differentiation. In G. Perren and J.L.M. Trim (eds), Applications of Linguistics, London: Cambridge University Press. 443-452.

© Analyze My Writing