The One Million Word Corpus

Vivian Cook Spelling Stats

The Spelling Stats pages mostly make use of a corpus of one million words of running text (1,014,931 words, 4,714,131 characters excluding spaces) (tokens) consisting of twentieth century English novels, reports, newspapers and articles, listed below. These are not the same as frequencies based on word frequency lists (types), e.g. Venezky (1970) 20,000; Carney (1992) 26,408; Gontijo et al (2003) 160,395 separate word forms. Like all available corpora, this exaggerates the importance of the printed text of books and newspapers etc at the expense of the hard to obtain ephemeral texts of everyday life. It uses texts that were conveniently to hand in digital form; though they are not representative of the range of English as a whole, because of the sheer ubiquity of spelling, the effects are meaningful. Code-breakers needed a hundred words or so to establish letter frequencies in coded text (Gaines 1938).

Author Title Genre
Word Count
A. Christie The Mysterious Affair at Styles Fiction
57,106
D.H. Lawrence Sons and Lovers Fiction
160,972
H. Harrison Deathworld Fiction
58,039
N Y Times Online articles Newspaper
25,296
Guardian Online articles Newspaper
30,194
E. Hemingway The Old Man and the Sea Fiction
26,597
n.a. I-phone Manual Manual
50,595
n.a. Languages: the Next Generation Report
42,535
P.G. Wodehouse Love among the Chickens Fiction
50,796
U. Sinclair The Jungle Fiction
150,753
E. O'Neill Anna Christie Play
25,296
D. Graddol English Next Report
36,952
J. Galsworthy Awakening and to Let Fiction
96,387
M. Reynolds Adaptation Fiction
23,594
J. Joyce Portrait of the Artist as a Young Man Fiction
86,011
n.a. Common European Framework Report
93,808