is a way of using the computer to find out how words behave. It is used by
dictionary makers, language teachers and others to find out how often words
occur and what other words they go with and to work out their meanings from the
sentences in which they occur. Researchers in children’s language see what
words children are using; researchers in second language acquisition find out L2
learners’ mistakes; non-native students of English find out aspects of words
practically rather than by looking them up in a dictionary. It is used
informally on many pages of this book.
concordance vocabulary, you first need a set of texts in electronic form: this
is your ‘corpus’. A corpus of written language is now easy to collect over
the internet from sites like Project Gutenberg with complete novels, from
newspaper archives, and the like. Anybody can make their own database of written
language; any digital text you can lay your hand on will do, whether a novel by
P.G. Wodehouse or the speeches of Tony Blair, both used in this book. It is more
difficult to get a corpus of spoken language as this means first recording it,
then laboriously writing it down – according to one rule of thumb an hour of
tape takes eighteen hours transcribing. There are rather few transcripts of
spoken English available, though they do make up part of large corpora such as
the British National Corpus.
you need a computer program to carry out the analysis. Google can be used for a
simple count of words: feed in immunosurveillance
and it lists 58,900 pages; feed in phone
and it lists 933 million; so we know something of their relative frequency. But
of course Google is counting pages, not words themselves: a word may be used
many times on a single page. Google also has many pages in languages other than
English, as I found to my cost when I tried to google til as a spelling mistake for till
and discovered it was a frequent Scandinavian word.
program that is specially designed to do this is called a concordancer.
Essentially this counts words and works out which words occur together. Some
concordancers are free on the internet – search for Compleat Lexical Tutor if you’re interested; others cost
reasonable sums like WordSmith, used
in this book; the state of the art for professional dictionary makers is Sketch
Engine, also available for a comparatively small amount. After you’ve
mastered how the program works, you can start asking it questions about your
Wodehouse’s 1919 book My Man Jeeves, downloadable
from the Gutenberg Project, can be a test case of a corpus. The novel is 51,431
running words long and has 5,017 different words – information instantly
provided by the concordancer.
easiest question to settle is the most frequent words in the novel. The top
twenty words are in fact:
the I to a and of you it was he in
that said had me on for at his with
other words the top twenty are ordinary structure words of English, which are
very frequent in any piece of written English.
it is also possible to focus on a particular word. Let us take girl,
prominent in any P.G. Wodehouse novel. Sure enough girl
occurs 50 times, girls 5 times and girl’s
6 times, 61 times in total. So girl
effectively occurs once every 869 words. In the 100 million running word British
National Corpus the rate for girl is,
however, one in 3942 words. P.G. Wodehouse is thus using the word girl
4.5 times more than general English. Of course the word might be common to this
style of writing or to this period of English rather than an idiosyncrasy of
this author’s style.
does P.G. Wodehouse actually use the word girl?
The most useful information that a concordancer provides is a list of all the
examples of girl in the corpus along
with their surrounding context. Here are the first five occurrences of girl
in the novel. The test word girl
appears centred in the middle of the line. It does take a while to make sense of
y to stop young Gussie marrying a
|on the vaudeville stage, and I|
|2||apartment one afternoon, shooing a||girl||in front of him, and said, "Ba|
|3||so scared Mr. Wosster," said the||girl.||"We were hoping that you mig|
|4||orts." "Thank you, sir." The||girl||made an objection. "But I'm|
|5||tter of Gussie and the vaudeville||girl||was still fresh in my mind.|
is called a KWIC display– Key Word In Context. At first it looks strange,
giving about six words before and after the key word girl, cut off in mid-word and mid- sentence. But these chunks often have all the information you need to study
the word. If you need more, you can expand any example. For instance No 5
becomes in full:
the recollection of my Aunt Agatha's attitude in the matter of Gussie and the vaudeville girl was still fresh in my mind.
1 and 2 show that girl goes with the
article a, which may be all you need
if you are looking for proof that it is a certain kind of noun, namely
‘countable’. It also shows that girl
can be the object of the verb marry
within a phrase young Gussie marrying a
girl, a high frequency collocation.
information from this example of the word girl
seems banal. But when it is multiplied by the 61 examples in the novel, it tells
you more, the 25,366 examples of girl
and girls in the BNC still more.
words come near to girl in the text?
Taking ten words before and after every occurrence of girl and leaving out function words like the, the words that often occur in the vicinity of girl
are: stage, quarrel, man, vaudeville, love, married, engaged, pretty – a
fairly good impression of young women’s activities in P.G. Wodehouse books.
For a dictionary maker, this gives a good idea what girl meant in this kind of book at a particular period of time.
a comparison from the same decade, let us take Virginia Woolf’s 1915 novel The
Voyage Out, 137,530 words long with 9,542 different words. Already the fact
that it has double the number of individual words of My Man Jeeves suggests it is more demanding on the reader. Forms of girl
occur 48 times or once every 2,865 words, far less than in P.G. Wodehouse. Here
are the first five examples of girl in
|1||hat her boy was like her and her||girl||like Ridley. As for brains t|
|2||was a bore; Rachel was an unlicked||girl,||no doubt prolific of confide|
|3||ke a child and come cringing to a||girl||because she wanted to sit wher|
|4||a child was for her health; as a||girl||and a young woman was for what|
|5||ch, -- and as it happened, the only||girl||she knew well was a religious|
about vaudeville here! The contexts
for girl give a quite different
impression from Wodehouse.
make this more solid, we can compare the proportion of times the two authors use
a word. P.G. Woodhouse uses absolutely
and pretty ten time as often as
Virginia Woolf, boy eight times as
often, girl five times as often and old
3.5 times as often. In reverse Virginia Woolf uses people
and Mrs nine times more often, men
and women five times more often and world
four time more often. So, while PG Wodehouse writes about boys and girls,
Virginia Woolf writes about men and women.
of the differences the computer throws up may be trivial. Others raise questions
one wouldn’t otherwise have thought of. Why for instance does P.G. Wodehouse
use the pronoun I three times as often
as Virginia Woolf? Is it just that his characters spend their time in light
badinage about each other or is it a more profound aspect of their worldviews?
this is only the tip of the iceberg. Vast amounts of statistical comparisons can
be produced at the touch of a mouse. Concordancers provide an instant way of
comparing frequencies between any pair of texts that you can enter in digital
form. And also they are crucial in studying the grammatical patterns of the
language, outside the scope of this book.
Sketch Engine for instance goes far beyond simple counting: a word sketch it
produced for the word impression tells
you it is usually an object of the verb to make an impression, that it goes with adjectives such as lasting
and misleading, and that it occurs in phrase such as an
impression of objectivity/strength/progress. Nor is it just dictionary
makers who find it useful. Other uses include researchers looking for mistakes
made by users of English, literary critics establishing who wrote a text, or
forensic linguists trying to test someone’s guilt in a court of law.