CS4:COMP_LN8 BUILDING A PARSER The aim of a parser is to construct a much more flexible and accessible form of the source program that the original text, or even the token stream output by the lexical analyser. It does this with the aid of a grammar describing the language; as a side effect of parsing syntax errors are detected. Parsers come in two main types called "top-down" and "bottom-up". Bottom-up parsers compare the lexical tokens, often in pairs, with the grammar detecting all valid occurrences or juxtapositions of these symbols. As more symbols are drawn in the number of valid alternatives is reduced until only one is left. Top-down parsers try all the alternatives of the grammar in succession trying to match them with the statement; if the end of the statement is reached then a valid construct has been found. Top-down parsers are claimed to be slower than bottom-up parsers because they may go some way down wrong paths before discovering an impasse. On the other hand they are simpler to write and work well in languages where all, or most, different constructs are introduced by a separate keyword. Algol is just such a language. Consequently I will concentrate on parsing Algol by top-down methods. For those interested in a more detailed treatment of the various methods of parsing there is a very clear description in the "Principles of Compiler Design" by Aho & Ullman. Grammar ------- Consider the following fragment of grammar:- P(DECLN) ::= (TYPE) (IDLIST) | (TYPE) 'array' (IDLIST) (Bplist) | etc P(TYPE) ::= 'integer' | 'real' | 'Boolean'; P(IDLIST)::= (NAME), (IDLIST) | (NAME); and the statement 'real' X,Y; The parser, having tried other phrases like statement and failed, is now trying DECLN. The steps are as follows:- Try DECLN Try Alt 1 scalar declaration Try phrase TYPE Try Alt 1 'integer' - fail Try Alt 2 'real' succeed Try phrase IDLIST Try ALT 1 Try name - succeed - X Check comma OK Try IDLIST Try ALT 1 Try name - succeed Y Check comma - fail Try ALT 2 Try name succeed Y IDLIST OK IDLIST OK SCALAR Declaration OK Decln OK - line parsed The result is conceptually a tree DECLN / / ALT/ 1 / TYPE-----IDLIST / / ALT/2 ALT/1 / / 'real' NAME (X)---IDLIST \ ALT\2 \ NAME (Y) However, all we are interested in is the alternatives matched, and these can be represented as an alternative list 1 2 1 ((X)) 2 ((Y)) (DECLN) type IDLIST name IDLIST name leaving vague for the moment the alternative number for as name. This example illustrates a weakness of top-down parsing:- it can not distinguish the alternative forms for IDLIST until the comma - it then backs up which involves reparsing name Y before succeeding. Had ID list been defined - P(IDLIST) ::= (IDLIST),(NAME) | (NAME) the problem is infinitely worse - we descend indefinitely in IDLIST. This form, known as left recursion, totally defeats the parser. However, we can improve the grammar:- P(IDLIST) ::= (NAME)(IDLISTA); P(IDLISTA)::= ,(NAME)(IDLISTA)|; This avoids the backing up by bringing the literal (or terminal) comma - the distinguishing marker - to the start of the phrase. However, this now leaves IDLIST with only 1 alternative, which is pointless, so we now revise phrase (DECLN) :- P(DECLN) ::= (TYPE)(NAME)(IDLISTA) | (TYPE)'ARRAY'(NAME)(IDLISTA)(BPLIST) | etc This has illustrated two techniques for massaging a grammar to give a good performance on a top-down parser. One can also improve performance by optimal ordering of the alternatives - phrase type will meet 'integer' much more often than 'real' hence the order. Optimum ordering is not simple since it is affected both by frequency of occurance and frequency of rejection. Assignment is the most frequent simple statement but it is not easily distinguished from a label, so how should they be ordered? The answers to this, and other questions, can be obtained by studying the grammar in ERCC07:STUALGPS. The simplest, though tedious, way to build a recursive decent parser is to write a separate function for each phrase - these functions call each other when the definition of one phrase is in the terms of another. Assuming the array TOKENS has the lexical tokens and array ALIST is the output alternative list one could write a function for phrase type as follows:- %integer %fn PTYPE (%integer %name TOKENPTR,ALPTR) %integer curr token curr token = TOKENS(TOKENPTR); ! current token %if curr token=238 %start; ! 238 is 'INTEGER' ALIST(ALPTR)=1; ! FIRST ALT FOUND TOKEN PTR=TOKEN PTR+1; ! TO NEXT TOKEN ALPTR=ALPR+1; ! TO RECEIVING NEXT %RESULT=1; ! success %finish %if curr token=239 %start; ! 239 is 'REAL' ALIST(ALPTR)=2; ! 2nd Alt found TOKEN PTR=TOKEN PTR+1 ALPTR=ALPTR+1 %result=1 %finish %if curr token=240 %start; ! 240 is 'BOOLEAN' ALIST(ALPTR)=3; ! 3rd ALT found TOKEN PTR=TOKEN PTR+1 ALPTR=ALPTR+1 %result=1 %finish %result=0; ! no valid alternative found %end ! token ptr etc unchanged. There are about 100 phrases in the grammar for Algol and writing 100 functions is hard work; a better scheme is to write an interpreter function which interprets a phrase definition. An preprocessor turns the grammar into a %const %short %integer %array GRAM (256:N). Literal tokens appear in GRAM as a token value; phrases as a pointer in GRAM which must be >=256. The entry for phrase TYPE would look like:- __________ ! ! ! 3 ! Type has three alternatives. !--------! ! 1 ! The first alternative has 1 entry. !--------! ! 238 ! The first alternative is token 'INTEGER' !--------! ! 1 ! The 2nd alternative has 1 entry !--------! ! 239 ! which is token 'REAL' etc. !--------! ! 1 ! !--------! ! 240 ! !________! It is easy to write an interpreter to work through this definition and have the same effect as function PTYPE coded above. The alternative can be represented by a zero length entry or otherwise. This method can not deal with some phrases, particularly names and constants; these have to be provided by hand crafted functions. The suggested grammar has 31 such hand crafted or Built in Phrases (BIPs). About half these are action phrases to generate a symbol table, or dictionary, while parsing. Three are frig phrases to avoid backing up (CHKLAB, CHKLPL & NOMORE); two syntax check names (NAME & ONAME) and two check and evaluate constants. TEXTTEXT deals with Algol strings while COMTEXT, ENDTEXT and LETTERSTRING deal with varieties of comments. Most of these BIPs make no entry in the tree - like literals they guide the parser. However, names have an alternative entered, the value being the number of the dictionary entry. Constants and strings are entered into the ALIST preceeded by a type and length entry. All valid Algol declarations and statements are alternatives of P(SS) and the parser starts with the pointer to this phrase which is provided by the grammar pre-processor. It should return with result=1 where upon the alternative list is linked into the workfile with pointers to its predecessor and successor. If it fails some error recovery is needed before the next statement is parsed. Some of the finer points of parsing must wait until the symbol table has been described.