The One Million Word Spelling Corpus
The Spelling Stats pages mostly make use of a corpus of one million words of printed running text (1,014,931 words, 4,714,131 characters excluding spaces) (tokens) consisting of twentieth century English novels, reports, newspapers and articles, listed below. These are not the same as frequencies based on word frequency lists (types), e.g. Venezky (1970) 20,000; Carney (1992) 26,408; Gontijo et al (2003) 160,395 separate word forms. Like all available corpora, this exaggerates the importance of the printed text of books and newspapers etc at the expense of the hard to obtain ephemeral texts of everyday life. It uses texts that were conveniently to hand in digital form; though they are not representative of the range of English as a whole, because of the sheer ubiquity of spelling, the effects are meaningful. Code-breakers after all needed a hundred words or so to establish letter frequencies in coded text (Gaines, 1938).
Author | Title | Genre | Word Count |
A. Christie | The Mysterious Affair at Styles | Fiction | 57,106 |
D.H. Lawrence | Sons and Lovers | Fiction | 160,972 |
H. Harrison | Deathworld | Fiction | 58,039 |
N Y Times | Online articles | Newspaper | 25,296 |
Guardian | Online articles | Newspaper | 30,194 |
E. Hemingway | The Old Man and the Sea | Fiction | 26,597 |
n.a. | I-phone Manual | Manual | 50,595 |
n.a. | Languages: the Next Generation | Report | 42,535 |
P.G. Wodehouse | Love among the Chickens | Fiction | 50,796 |
U. Sinclair | The Jungle | Fiction | 150,753 |
E. O'Neill | Anna Christie | Play | 25,296 |
D. Graddol | English Next | Report | 36,952 |
J. Galsworthy | Awakening and to Let | Fiction | 96,387 |
M. Reynolds | Adaptation | Fiction | 23,594 |
J. Joyce | Portrait of the Artist as a Young Man | Fiction | 86,011 |
n.a. | Common European Framework | Report | 93,808 |