Designing a BASIC Parser for CALL
CALICO Journal, 6, 1, 1988, pp. 50-67.
Writing a parser to give syntactic representations of studentsí input in BASIC is justified by the limited scope of the grammar required and by the need to integrate with other CALL materials. BAP (BASIC parser) is described in terms of input processing, which normalizes the sentence; word-matching, which matches it against key structure words and against its own lexicons; and phrase structure parsing, which examines the constituent structure and sub-categorization features of the sentence. A technique is described based on DCG parsing in PROLOG that attempts to consume the words of the sentence one at a time by rule; it is exemplified through simplified versions of its rules for parsing sentence type and the noun phrase, and through examples of its sub-categorization tests for properties of lexical entries. It is suggested that such a parser can form a useful component in many CALL activities.
The overall goal is to create a parser that can provide syntactic representations of the language used by students in the first year of learning English- -the typical syntax and vocabulary covered in a beginnerís coursebook; and one that can support the student in a range of Computer-Assisted Language Learning (CALL) exercises. Most parsers are written in languages such as LISP and PROLOG because of the relative greater ease with which they handle natural language compared to BASIC. Why then was it thought necessary to design one in BASIC? A CALL parser deals with simplified language since the beginner has a very limited range of structures at his or her disposal; the complexities that need more sophisticated computer languages are not required. Furthermore, a CALL parser has to deal with a range of syntactic forms produced by students that are systematically deviant from the target language; the parser must be practical and robust, without the frailty of many otherwise more comprehensive systems.
A pragmatic reason for BASIC is that it has characteristically been used for CALL programs for English as a Foreign Language. The microcomputers used in EFL either do not support other languages, or they support them in limited versions that do not make usable programs feasible. For example, the BBC B microcomputer used as the main language teaching computer in schools, universities and language colleges in Britain and in British Council centres overseas has a fast and sophisticated BASIC but lends itself to MICRO-PROLOG or LISP with difficulty. As the purpose of this parser was to fit existing practice rather than to be a stand-alone, it was necessary to use BASIC not only to interface with the actual programs used for teaching English but also so that teachers should be able to have access to the program to make their own adaptations. Another less significant reason was the comparative ease of screen presentation and control at the level required for student use; more sophisticated languages typically have far less friendly interfaces with the student or teacher. Finally, the intention was to have something that was practical and usable within a short development period. This resulted in a fairly unsophisticated but robust program that was not simply an academic demonstration of linguistic niceties but worked within the needs of the beginning student. The use of BASIC also meant developing a particular style of parsing which can be called BAP (BASIC Parsing); while the limited arms of the parser allow the program to work, beyond a certain level each increase in parsing ability becomes uneconomic in terms of effort and programming complexity. The first version of this parser was~ a program written in BBC BASIC for the BBC B microcomputer. The second version was written in QuickBASIC for an IBM PC Compatible.
The essential assumption is that CALL does not require a parser of great sophistication. The problems that limit full parsers do not apply to CALL parsers because the type of language in which they arise simply does not occur. For example, consider the opening of a sentence such as "Have the students met .."; a conventional parser might start by considering that this is likely to be a question such as "Have the students met the regulations?"; if, however, it discovers the sentence stops there as "Have the students met!" it has to abandon its whole analysis and start again with the supposition that it is dealing with an imperative. The influential Marcus parser (Marcus, 1990) used a look-ahead buffer with three constituents precisely to overcome this problem. But this kind of syntactic complexity either does not arise in beginnerís English in the first place or is peripheral to the studentsí needs in any case. The students are in many ways restricted users of language; the limitations of parsing techniques may be no obstacle to their use. Because of the finite needs of the student and of the grammar, many problems that would need complex solutions in a full parser can be dealt with on an ad hoc basis, as we shall see.
The first aim is to cover the syntax used in the first year of learning English. On the one hand, this simplifies the parser by excluding structures that are not usually taught in this period, such as conditional sentences and passives; on the other hand, this means including structures that are normal in language teaching because they are needed for everyday discourse but are not usually covered by parsers. These structures include minor sentences which range from the short answers "Yes," "OK," or "He did," or prepositional phrases "At six," to the short questions "Do you?" and so on. The content of this grammar does not need sophisticated linguistic theories but should be within the traditional structurally-oriented grammars used in language teaching, for ease of use by student and by teacher. The vocabulary similarly can be restricted to the thousand or so words introduced in a typical first year.
A second aim is to make the parser adaptable to a range of teaching exercises. The parser should be an aid available upon student request, regardless of where the student is in the parser/program. It must be possible to integrate the parser with drill exercises, in which the studentís responses can be fully analyzed; with text exercises where the student feeds in an essay and gets sentence by sentence help; with fill-in grammatical exercises where it analyzes the studentís answer; or with functional dialogues, where it must assess appropriateness to speech function and to discourse position, albeit with some supplementation in each application.
Third, it also needs to be able to cope with the types of mistakes that L2 learners of English produce and to note them or to make suitable comments about them, if requested. Thus it has to be able to deal with rough but systematic input that represents a stage in the learnerís progress and does not necessarily conform to the native speakerís version of English. For example, a sentence such as "What you like?" cannot simply be discarded by the parser as ungrammatical but needs to be marked as lacking the proper word order and agreement. In a sense then it needs to incorporate a simple model of the learner.
Fourth, it should provide information on any word in its lexicon as to syntactic category, and sub-classification, and a translation into the studentís language, again on request.
The main aim is not to use the parser itself as a type of exercise but to let the students fall back on it as an aid whenever they want to; it is neutral between teaching methods. Rather than forcing the students to use the parser, it is there whenever they require it, rather like a program such as SIDEKICK. The initiative is passed to the students to fit the parser into their own learning styles rather than imposing it on them willy nilly.
Input Processing in BAP
As an illustration let us assume that the student has typed in a typical sentence such as:
"Where is the London train?"
This might be part of a functional dialogue, a fill-in exercise, or whatever. BAP first needs to convert the input into a string of words in a uniform case. After eliminating sentence-final punctuation marks and ensuring the sentence end with a space, the entire string is put into lower case letters by adding 32 to any ASCII character between 65 and 90, yielding:
Finally the string is converted into a numbered sequence of words by assigning characters between spaces to a word. Because of the restriction to first year English, the complexities of acronyms like "UNO" do not arise in sufficient numbers not to be solved individually. Thus the end-product of input processing is:
Item 1 Item 2 Item 3 Item 4 Item 5
word where is the london train
Word Matching in BAP
BAP requires a routine to establish a part of speech for each word in the sentence. This is achieved by a straightforward matching routine. The basic division is between closed-class structure words, such as "a," "to," and "might," which are listed in the program itself, and open-class lexical words, such as "train" and "see," which are stored in the lexicon. A list of structure words has been built up from previous keyword matching programs described in Cook (1988); amounting to some 220 items, divided into modifiers, pronouns, prepositions, auxiliaries, coordinators, adverbs, wh- words, negatives, and phatics (such as "yes" and "hello"); these are divided into sub-categories such as positive or negative auxiliaries, demonstrative versus determiner modifiers, or object versus subject pronouns. This is the only part of the program that represents a comprehensive part of English rather than a restricted subset; in some ways it is a luxury to have items such as "despite," "thus," or "seldom" in the program. Each word in the sentence is first matched against the keywords and, if a match is found, assigned to the appropriate category. If no match has been found, the program searches three lexicons for a match, namely Noun, Verb, and Adjective. For each of these it normalizes the word by stripping off irrelevant morphological endings such as "ing" or "s," reducing certain double consonants to single consonants, e.g. "stopping" to "stop," and so on; there are also checks against the irregular past tenses such as "went" and irregular past participles such as "gone"; the small number of such irregular items needed at beginnerís level are listed in the program. When it matches a lexical item, it copies from the lexicon the syntactic category, the syntactic subclass, and the translation. If no match has been found a second strategy is employed. In the case of Nouns, BAP searches the relevant lexicon once again with the initial letter capitalized, e.g. "london" is restored to "London" to match proper nouns. If it still does not find a match, BAP asks the user whether the word is correctly spelled; if the word is indeed spelled correctly, it creates a new lexical entry, requesting the user to supply the basic information for the item. The new lexical entry in BAP consists of the word itself, with initial capital letter for proper names, a syntactic sub-category for each word, e.g. for Nouns, giving C (countable), U (uncountable), H (human), and P (proper), and a literal translation, the specimen language being French. The form of the sentence is now:
Item 1 Item
4 Item 5
word where is the London train
category wh-word aux det noun noun
subclass - - - proper count
translation - - - Londres train
The syntactic category of "is," at present "aux," will need later adaptation since it is functioning as a copular in this sentence. As the lexicon deals only with the few hundred words in the beginnerís vocabulary, it is searched in linear order; more efficient means of searching may be necessary when larger lexicons are used. The lexicon is open-ended in that the teacher or student may add new items when necessary. In the IBM QuickBASIC version this is done through APPEND; on the BBC version it involves opening a dummy file to which the existing file is copied
and to which the new item is added, and then substituting this for the original file. The matching process can be seen in the following flow diagram for the word matching process.
The matching process then yields a sequence of words, each with a corresponding syntactic category, sub-categorization, and translation.
Phrase Structure Parsing in BAP
The immediate aim for the actual syntactic parsing was to parallel the workings of a DCG (Definite Clause Grammar) parser written in the grammar rules of PROLOG. Fuller descriptions of such parsers can be found in Cook (1987) and in Pereira and Shieber (1987). Such parsers take the familiar form of rewrite rules, such as:
1. sentence --> noun_phrase,
2. noun_phrase --> noun.
3. noun__phrase --> determiner, noun.
4. verb__phrase --> verb.
5. verb__phrase --> verb noun__phrase.
Given an input, for example, the sentence "Men sing," this parser starts with the goal of finding a sentence:
According to rule 1 this breaks down into the sub-goals of looking for a noun-phrase and a verb-phrase:
Proceeding from the left, rule 2 breaks the goal of noun__phrase? down into the sub-goal of looking for a noun:
It checks its lexicon to see whether "men" counts as a noun; if this is in fact the case, it has met the goal of noun? and the goal of noun__phrase? It can now consume one word of the sentence and go on to see if the new word, "sing," conforms to the remaining goal in rule 1:
This breaks down in rule 4 into the sub-goal:
If "sing" is a verb, the parser has met the goal of verb? and the goal of verb phrase?; both sub-goals (noun__phrase? and verb__phrase?) of sentence? have been met and so the main goal of finding a sentence also succeeds.
When the rule it is using fails, the parser goes on to the next appropriate rule in the program. Let us take a new sentence to see how this operates:
"The chairman will resign."
The parser fails to find a noun using rule 2, so it goes on to the next rule mentioning noun, namely rule 3, and asks:
It sees if "the" matches something on the determiner list, succeeds, checks if "chairperson" is on the noun list, succeeds, and now can meet the goal of noun phrase? Similarly, when it tries for a verb-phrase "will resign" and does not succeed with rule 4 in finding a verb, it goes on to rule 5 and asks:
both of which sub-goals succeed. It has processed the sentence by passing on to the next appropriate rule whenever it fails; only if it cannot find an appropriate rule does the parse fail. The parsing is top-to-bottom in that it looks for the highest level constituents first, and it is leftóto-right in that it consumes the sentence from beginning to end. The parser does not have syntactic transformations or movement of elements in the sentence from one place to another; questions such as "Who did you see?" are not related by movement to sentences such as "You saw who?" Instead it has two rules, one for questions and one for statements, which are tried in succession. In a sense, this reflects an adaptation of the approach found in Generalised Phrase Structure Grammar (GPSG) (Gazdar et al, 1985), a form of syntactic description that has been widely used for parsing. GPSG claims that the relationships between such structures are best handled through metarules that assert that if a sentence of type X is grammatical so is a sentence of type Y. Such enumerations of alternative rules are equivalent to a metarule, a practical solution to the difficulty of actually incorporating metarules in a parser.
BAP simulates, rather than duplicates, this style of DCG parsing in BASIC, starting with the top-down procedure, and left-to-right direction. The goal of BAP is to consume all the words in the sentence. When BAP has consumed all the words, it has found a possible sentence of English; if it has only partially consumed the words of the sentence, it has not found an acceptable sentence. The program can be seen as a way of imposing restrictions on the way words are consumed that mirror the restrictions of English. The crucial BAP procedures are a test of whether all the words have been consumed, called allgone?; a simulated backtracking technique called SETO; a word-consumption procedure called EAT; and sets of individual requirements for particular aspects of phrase structure with the overall title of TESTS.
We can see these in operation by looking at the top level of BAP 0- the division into sentence types. Simplifying slightly, the basic sentence types employed are inversion questions (starting with an auxiliary), wh- questions (starting with a question word), declaratives (a noun-phrase followed by a verb-phrase), imperatives (verb initial), and minor sentences which include verbless sentences such as "John" and "In Paris" and short form questions and statements such as "Did he?" and "He did." This can be put as the following rules:
Types of sentence
(i) sentence --> inversion question
(ii) sentence - -> wh- question
(iii) sentence - -> declarative
(iv) sentence --> imperative
(v) sentence - -> minor
The program goes through the stages seen in the following diagram. BAP uses entry tests to eliminate certain possibilities; it does not try for a minor sentence for instance if there is no verb in the sentence. The initial goal (~~inversion question?") represents the appropriate entry test, i.e., whether the first word is an auxiliary; the capital letter items such as INVQ refer to specific parsing routines; allgone? tests if all the words have been consumed. If BAP does not succeed in meeting a goal of a certain sentence type, SETO restores all the relevant variables to zero and passes on to another sentence type. When the parser has managed to consume all the words using a particular path, it exits the system carrying the relevant message about the sentence type it has found; if it gets through the whole system without consuming all the words, the message is "unknown type." Each sentence type is then expanded into various elements, mostly phrases but sometimes other constituents, i.e.:
(vi) declarative --> noun-phrase verb-phrase
(vii) inversion question --> auxiliary noun-phrase verb-phrase.
To test the sub-goal:
the program sets two sub-goals, the first of which is:
This brings us to the phrase level in BAP, which includes parsing routines for noun-phrase, verb-phrase, and prepositional-phrase. Let us take the noun-phrase as an illustration of how one section works in some detail. When the rule invokes a noun_phrase? goal, an entry test sees if the current word is an adjective, wh- word, determiner, noun, modifier or quantifier so that it can enter the noun-phrase section. A noun-phrase may range from a single pronoun to complex combinations of determiners, demonstratives, relative clauses, and prepositional phrases. Some of the possibilities recognized by BAP are:
--> premodifier adjective head postmodifier
(ix) premodifer --> demonstrative "this" noun
(x) premodifier --> determiner "the" noun
(xi) premodifier --> pronoun "his" noun
(xii) head - -> pronoun "he"
(xiii) head --> noun "John"
(xiv) postmodifier - -> relative clause noun "who came in"
(It should perhaps be noted that the rules allow for zero premodifiers and postmodifiers, not shown here, i.e., noun phrases can consist simply of a head noun or pronoun.)
As with sentence types, once BAP has started the noun-phrase, it attempts to consume all the words until it reaches EXIT; if it is successful, it has found a noun-phrase. As soon as it has identified a particular part of the noun-phrase, it carries out specific TESTS, for example, whether it is the right determiner for a proper noun. Then it EATs the word and goes on to the next.
Let us take a noun-phrase to see how this works:
"a man who is tired of London."
The parser first carries out an entry test that decides that the determiner "a" is a suitable opening to a noun-phrase. The first sub-goal of the noun-phrase routine is:
This in turn leads to:
This goals is not met do it falls through to:
which is met by the determiner "a"; this leads to a set of TESTS relevant to determiners, in this case whether the next word starts with a vowel and whether, if a noun, it is countable and singular; after this the current word is EATen and attention passes to the next word "man." This invokes the sub-goal:
which is unsuccessful, and then:
As this is not successful, the next sub-goal is invoked:
"Man" is indeed a noun, so relevant TESTS are carried out to see if it is a countable noun and therefore permissible for "a," and so on; the word is EATen, and "who" becomes the current word. The next sub-goal is:
An entry test accepts "who" as a possible opening for a relative clause, which is then treated in a separate routine. If the program gets to EXIT, the goal of finding a noun-phrase has been satisfied; if it has not found a possible noun-phrase head in an appropriate position, it fails the NP system and goes on to the next main goal.
Having identified the word as being EATable, the parser needs to test it for the various limitations on its occurrence, i.e., its
sub-categorization features. These TESTS involve a knowledge of the immediate environment- -usually the properties of the last word and the next word. Take the determiner TESTS: if the determiner is "the," it checks whether the part of speech of the following word is noun or adjective; if it is indeed a noun, the test checks to see if its sub-categorization is proper, to exclude "the Paris" or "the Peter" (to be more accurate this only applies if the following noun is a head rather than a modifier, so as not to exclude "the London train"); at beginnerís level the exceptions such as "the Hague" or "the Netherlands" are unlikely to occur; those that do can be covered on an ad hoc basis. Alternatively, if the determiner is "a" or "an," it checks whether the following noun is a countable to avoid "a money." The subtleties of when a money" is grammatical are unnecessary at this level, with the exception of the unavoidable need at the beginnerís level to cover "a beer" and similar requests using apparent uncountables with indefinite articles. TESTS also checks whether the next word after "a" starts with a consonant or vowel. The few obvious exceptions at beginnerís level can be built in individually, such as "a university" and "an hour."
There are similar sub-categorization tests involving features of the Verb. Verbs are stored with a description of what they may be followed by, i.e., single objects as in "I like melons," double objects as in "I gave him it," prepositional phrases as in "I went to the shops," -ing nominalizations such as "He likes going to the cinema," and so on. When the Verb is about to be EATen, the local environment is tested to see whether these requirements are met.
And so on for the other specific TESTS. The aim is to anticipate the studentsí possible mistakes rather than to reject the sentence outright. Mostly the TESTS require information about the last word and the next word, sometimes the next word but one, incorporated in the EAT procedure. This seems a tighter requirement than that of the Marcus parser whose buffer contains information about the next three constituents which may be any number of words.
Since BASIC is not set up to use features in the same way as PROLOG, BAP essentially has to use dimensioned variables to indicate various grammatical features, the three most prominent being number agreement, case, and tense.
Number Agreement. Each noun-phrase and each verb-phrase has to be marked for singular or plural and a routine provided for establishing which goes with which. Take:
"A man likes books."
Establishing the number of the two noun-phrases means finding "a man" is singular and "books" is plural. Chiefly, this requires testing for final "s" but also requires a list of the exceptions such as "people" or "news" that may be found at this level. Tests also have to be made of studentsí propensity for omitting plural "5"; of the use of "a" with a singular noun, and for the agreement of "this/that" and "these/those" with the head noun. Finding the number of the verb-phrase again rests chiefly on the presence of "5" on the main verb to denote singular, anything else being plural; number is associated with the first auxiliary in a complex phrase such as "is going."
The actual agreement of number between noun-phrase and verb-phrase is done by a reiterative matching process: noun-phrase I is matched against verb-phrase 1 so that in the sentence:
"The man likes the girl."
the singular number for noun-phrase 1 "the man" is matched against the singular number for verb-phrase 1 "likes." Some important exceptions to this simple reiterative process at the beginnerís level are:
- "There" subjects such as "There are some books in here," where the matching needs to be between verb-phrase 1 and noun-phrase 2.
- Relative clauses such as "The man the girl saw liked the books," where the agreeing phrases are not adjacent.
- Coordinated noun-phrases such as "The man and the girl like books," where two singular noun-phrases need to be given plural number jointly.
- Tense. The features of the verb-phrase have to be coherent. Mostly this is a question of checking for permissible sequences of items. If the main verb ends in -ing, there must be a preceding auxiliary with a form of "be" constituting a present or past continuous. If the main verb is a past participle, there must be a preceding auxiliary with "be" to make up a passive, or a preceding auxiliary with "have" to make up present or past perfect. Because of the well-known student mistakes of the form "Do he liked?" or "Did he went?", BAP must note that only one item in each verb-phrase can be marked for tense, and that this must be the first item, whether auxiliary or verb.
- Case. In English, Case as a surface feature manifested in the morphological endings of nouns is not relevant except in the personal pronoun system. BAP handles this simply by testing for the correct case form at the appropriate point: for nominative case ("I," "he," etc.) after the initial noun-phrase in a declarative; for accusative case ("me," "him," and so on) after the nounóphrase in the verb-phrase or after the preposition in a prepositional-phrase; for genitive case ("my," "her," and so on) when a pronoun precedes a noun.
Finally the program has to store sets of error messages to inform the student if he or she has gone wrong. Each time a test fails, a message number is stored; at the end of the parse the appropriate message is produced. Students are able to select whether they simply want to know whether the sentence is right or wrong, or whether they want specific help with the syntax, the vocabulary, or a literal translation. If they have asked for all three, a sentence such as:
"An man go the station.
produces the following display:
"You have produced a comprehensible example of a declarative sentence with 3 mistakes:
AN only before a vowel
Singular subject needs a singular verb ~ needs to have "to" after it
man= homme go = aller station = gare
The terminology and content of these error messages clearly needs adaptation in the light of the needs of a particular group of students; they might in fact be given in the studentís own language. Or, indeed, the information might be supplied in different ways; for example, by displaying the error and asking the student to check which of four possible alternatives are the right error message, or by showing tree or box diagrams of the sentence structure, etc.
The BAP approach outlined above shows that it is indeed possible to parse beginnerís level English in BASIC on microcomputers. The program is not particularly lengthy, the parsing section itself amounting in BBC BASIC to some 260 actual lines. Nor does the program take particularly long to parse a sentence - usually less than ten seconds, a major component being lexical search. Obviously there are inconveniences using BASIC; small gains in syntactic accuracy beyond the level described here are achieved at the cost of increasingly complicated and spaghetti-like BASIC routines. Nevertheless, BAP works within the limits set for it.
The final question is what to do with it, which is the subject of a further article. Briefly the BBC B version has been used as:
- an independent stand-alone for the student to use when they wish. In this case it has formed part of a suite of programs called Communicative Learning Environment, which also include a dictionary and a rule-based spelling-to-phonetic-script converter.
- an aid in functional dialog role-plays such as simulated conversations at railway stations. In this case it is expanded with a section that accepts only certain structures as appropriate to certain speech functions, e.g. inversion questions for polite requests.
- an aid in continuous text writing by the students, i.e., essays, letters, and so on. The parser takes each sentence from the text in turn and evaluates it, rather like the spelling program attached to word processors.
Further developments of BAP are intended to include its use in:
- structure drills, in which the studentís answers can be parsed.
- fill-in or doze style exercises, in which the appropriateness of the filled-in item can be tested.
- grammatical explanations, where, rather than laconic messages, fuller explanations can be provided or cross-referred to grammar books.
It would seem in principle that BAP is sufficiently flexible to be suitable for all these uses. In due course BAP will provide a parser module that CALL exercise writers in BASIC can exploit for their own needs and ends. But the point of BAP is, in a sense, to demonstrate that it is possible to utilize ideas from the field of Natural Language Processing in CALL. The fact that the parser only partially covers English grammar and does not attempt to tackle some of the crucial issues of mainstream parsing does not disqualify it from CALL use where the immediate needs of the beginning students require a parser that deals with obvious facts such as agreement or noun-phrase structure, and a limited vocabulary suitable for their own restricted uses, rather than for anything more sophisticated.
Cook, V.J. "Natural language processing as interface between CALL and computing," Fremdsprachen und Hochschule, AKS, 20, 1987, 17-26.
Cook, V.J. "Communicative teaching with the computer", English Language Teaching Journal, to appear 1988.
Gazdar, G., et al. Generalised Phrase Structure Grammar, Oxford, Blackwell, 1986.
Marcus, M.P. Theory of Syntactic Recognition for Natural Languages, MIT Press 1990.
Pereira, F.C.N. and R.M. Shieber. PROLOG and Natural-Language Analysis, CSLI (Center for the Study of Language and Information), Stanford, 1987.
See also: UG as a parser