Concordancing: finding out about words

Vivian Cook  
SLA Topics 

Concordancing is a way of using the computer to find out how words behave. It is used by dictionary makers, language teachers and others to find out how often words occur and what other words they go with and to work out their meanings from the sentences in which they occur. Researchers in children’s language see what words children are using; researchers in second language acquisition find out L2 learners’ mistakes; non-native students of English find out aspects of words practically rather than by looking them up in a dictionary. It is used informally on many pages of this book.

To concordance vocabulary, you first need a set of texts in electronic form: this is your ‘corpus’. A corpus of written language is now easy to collect over the internet from sites like Project Gutenberg with complete novels, from newspaper archives, and the like. Anybody can make their own database of written language; any digital text you can lay your hand on will do, whether a novel by P.G. Wodehouse or the speeches of Tony Blair, both used in this book. It is more difficult to get a corpus of spoken language as this means first recording it, then laboriously writing it down – according to one rule of thumb an hour of tape takes eighteen hours transcribing. There are rather few transcripts of spoken English available, though they do make up part of large corpora such as the British National Corpus.

Then you need a computer program to carry out the analysis. Google can be used for a simple count of words: feed in immunosurveillance and it lists 58,900 pages; feed in phone and it lists 933 million; so we know something of their relative frequency. But of course Google is counting pages, not words themselves: a word may be used many times on a single page. Google also has many pages in languages other than English, as I found to my cost when I tried to google til as a spelling mistake for till and discovered it was a frequent Scandinavian word.

A program that is specially designed to do this is called a concordancer. Essentially this counts words and works out which words occur together. Some concordancers are free on the internet – search for Compleat Lexical Tutor if you’re interested; others cost reasonable sums like WordSmith, used in this book; the state of the art for professional dictionary makers is Sketch Engine, also available for a comparatively small amount. After you’ve mastered how the program works, you can start asking it questions about your corpus.

P.G. Wodehouse’s 1919 book My Man Jeeves, downloadable from the Gutenberg Project, can be a test case of a corpus. The novel is 51,431 running words long and has 5,017 different words – information instantly provided by the concordancer.

The easiest question to settle is the most frequent words in the novel. The top twenty words are in fact:

            the I to a and of you it was he in that said had me on for at his with

In other words the top twenty are ordinary structure words of English, which are very frequent in any piece of written English.

But it is also possible to focus on a particular word. Let us take girl, prominent in any P.G. Wodehouse novel. Sure enough girl occurs 50 times, girls 5 times and girl’s 6 times, 61 times in total. So girl effectively occurs once every 869 words. In the 100 million running word British National Corpus the rate for girl is, however, one in 3942 words. P.G. Wodehouse is thus using the word girl 4.5 times more than general English. Of course the word might be common to this style of writing or to this period of English rather than an idiosyncrasy of this author’s style.

How does P.G. Wodehouse actually use the word girl? The most useful information that a concordancer provides is a list of all the examples of girl in the corpus along with their surrounding context. Here are the first five occurrences of girl in the novel. The test word girl appears centred in the middle of the line. It does take a while to make sense of such displays.


y to stop young Gussie marrying a


on the vaudeville stage, and I
2 apartment one afternoon, shooing a girl in front of him, and said, "Ba
3 so scared Mr. Wosster," said the girl. "We were hoping that you mig
4 orts." "Thank you, sir." The girl made an objection. "But I'm
5 tter of Gussie and the vaudeville girl was still fresh in my mind.

This is called a KWIC display– Key Word In Context. At first it looks strange, giving about six words before and after the key word girl, cut off in mid-word and mid- sentence. But these chunks often have all the information you need to study the word. If you need more, you can expand any example. For instance No 5 becomes in full:

And the recollection of my Aunt Agatha's attitude in the matter of Gussie and the vaudeville girl was still fresh in my mind.

Examples 1 and 2 show that girl goes with the article a, which may be all you need if you are looking for proof that it is a certain kind of noun, namely ‘countable’. It also shows that girl can be the object of the verb marry within a phrase young Gussie marrying a girl, a high frequency collocation.

The information from this example of the word girl seems banal. But when it is multiplied by the 61 examples in the novel, it tells you more, the 25,366 examples of girl and girls in the BNC still more.

What words come near to girl in the text? Taking ten words before and after every occurrence of girl and leaving out function words like the, the words that often occur in the vicinity of girl are: stage, quarrel, man, vaudeville, love, married, engaged, pretty – a fairly good impression of young women’s activities in P.G. Wodehouse books. For a dictionary maker, this gives a good idea what girl meant in this kind of book at a particular period of time.

 As a comparison from the same decade, let us take Virginia Woolf’s 1915 novel The Voyage Out, 137,530 words long with 9,542 different words. Already the fact that it has double the number of individual words of My Man Jeeves suggests it is more demanding on the reader. Forms of girl occur 48 times or once every 2,865 words, far less than in P.G. Wodehouse. Here are the first five examples of girl in Woolf.

1 hat her boy was like her and her  girl like Ridley. As for brains t
2 was a bore; Rachel was an unlicked  girl, no doubt prolific of confide
3 ke a child and come cringing to a girl because she wanted to sit wher
4 a child was for her health; as a girl and a young woman was for what
5 ch, -- and as it happened, the only  girl she knew well was a religious

Nothing about vaudeville here! The contexts for girl give a quite different impression from Wodehouse.

To make this more solid, we can compare the proportion of times the two authors use a word. P.G. Woodhouse uses absolutely and pretty ten time as often as Virginia Woolf, boy eight times as often, girl five times as often and old 3.5 times as often. In reverse Virginia Woolf uses people and Mrs nine times more often, men and women five times more often and world four time more often. So, while PG Wodehouse writes about boys and girls, Virginia Woolf writes about men and women.

Some of the differences the computer throws up may be trivial. Others raise questions one wouldn’t otherwise have thought of. Why for instance does P.G. Wodehouse use the pronoun I three times as often as Virginia Woolf? Is it just that his characters spend their time in light badinage about each other or is it a more profound aspect of their worldviews?

And this is only the tip of the iceberg. Vast amounts of statistical comparisons can be produced at the touch of a mouse. Concordancers provide an instant way of comparing frequencies between any pair of texts that you can enter in digital form. And also they are crucial in studying the grammatical patterns of the language, outside the scope of this book. Sketch Engine for instance goes far beyond simple counting: a word sketch it produced for the word impression tells you it is usually an object of the verb to make an impression, that it goes with adjectives such as lasting and misleading, and that it occurs in phrase such as an impression of objectivity/strength/progress. Nor is it just dictionary makers who find it useful. Other uses include researchers looking for mistakes made by users of English, literary critics establishing who wrote a text, or forensic linguists trying to test someone’s guilt in a court of law.