CS4:COMP_LN8
                           BUILDING A PARSER


     The aim of a parser is to construct a much  more  flexible  and  accessible
form  of  the  source program that the original text,   or even the token stream
output by the lexical analyser.   It  does  this  with  the  aid  of  a  grammar
describing the language; as a side effect of parsing syntax errors are detected.

     Parsers come in two main types called "top-down" and "bottom-up". Bottom-up
parsers compare the lexical tokens,  often in pairs,  with the grammar detecting
all valid occurrences or juxtapositions of these symbols.   As more symbols  are
drawn in the number of valid alternatives is reduced until only one is left.

     Top-down parsers try all the alternatives  of  the  grammar  in  succession
trying to match them with the statement;  if the end of the statement is reached
then a valid construct has been found. Top-down parsers are claimed to be slower
than  bottom-up  parsers  because  they  may go some way down wrong paths before
discovering an impasse.   On the other hand they are simpler to write  and  work
well in languages where all,  or most,  different constructs are introduced by a
separate keyword. Algol is just such a language. Consequently I will concentrate
on  parsing Algol by top-down methods.   For those interested in a more detailed
treatment of the various methods of parsing there is a very clear description in
the "Principles of Compiler Design" by Aho & Ullman.

Grammar
-------

     Consider the following fragment of grammar:-

     P(DECLN) ::= (TYPE) (IDLIST) |
                  (TYPE) 'array' (IDLIST) (Bplist) | etc

     P(TYPE)  ::= 'integer' | 'real' | 'Boolean';
     P(IDLIST)::= (NAME), (IDLIST) | (NAME);

and the statement

     'real'  X,Y;

     The parser,   having tried other phrases like statement and failed,  is now
trying DECLN.
  The steps are as follows:-

   Try DECLN
           Try Alt 1 scalar declaration
              Try phrase TYPE
                 Try Alt 1 'integer' - fail
                 Try Alt 2 'real'      succeed
              Try phrase IDLIST
                 Try ALT 1
                     Try name  -  succeed - X
                     Check comma  OK
                     Try IDLIST
                         Try ALT 1
                             Try name - succeed Y
                             Check comma - fail
                         Try ALT 2
                             Try name succeed Y
                      IDLIST OK
              IDLIST OK
           SCALAR Declaration OK
  Decln OK  -  line parsed

The result is conceptually a tree

                  DECLN
                   / 
                  /
              ALT/ 1 
                /
               TYPE-----IDLIST
               /       / 
           ALT/2   ALT/1
             /       /
          'real'    NAME (X)---IDLIST
                                   \
                                 ALT\2
                                     \
                                    NAME (Y)

     However,   all we are interested in is the alternatives matched,  and these
can be represented as an alternative list

         1       2       1      ((X))       2       ((Y))
      (DECLN)   type   IDLIST   name      IDLIST    name

leaving vague for the moment the alternative number for as name.

     This example illustrates a  weakness  of  top-down  parsing:-  it  can  not
distinguish  the alternative forms for IDLIST until the comma - it then backs up
which involves reparsing name Y before succeeding. Had ID list been defined -

          P(IDLIST) ::= (IDLIST),(NAME) | (NAME)

the problem is  infinitely worse -  we descend  indefinitely in  IDLIST.
This form, known as left recursion, totally defeats the parser.

  However, we can improve the grammar:-

         P(IDLIST) ::= (NAME)(IDLISTA);
         P(IDLISTA)::= ,(NAME)(IDLISTA)|<empty>;

     This avoids the backing up by bringing the literal (or  terminal)  comma  -
the distinguishing marker - to the start of the phrase. However, this now leaves
IDLIST with only 1 alternative,   which is pointless,   so we now revise  phrase
(DECLN) :-

         P(DECLN) ::= (TYPE)(NAME)(IDLISTA) |
                      (TYPE)'ARRAY'(NAME)(IDLISTA)(BPLIST) | etc

     This has illustrated two techniques for massaging a grammar to give a  good
performance  on a top-down parser.   One can also improve performance by optimal
ordering of the alternatives - phrase type will meet 'integer' much  more  often
than 'real' hence the order. Optimum ordering is not simple since it is affected
both by frequency of occurance and frequency of rejection.   Assignment  is  the
most  frequent simple statement but it is not easily distinguished from a label,
so how should they be ordered?  The answers to this, and other questions, can be
obtained by studying the grammar in ERCC07:STUALGPS.

     The simplest,  though tedious, way to build a recursive decent parser is to
write a separate function for each phrase - these functions call each other when
the definition of one phrase is in the terms of another.

     Assuming the array TOKENS has the lexical tokens and  array  ALIST  is  the
output alternative list one could write a function for phrase type as follows:-

   %integer %fn PTYPE (%integer %name TOKENPTR,ALPTR)
   %integer curr token
   curr token = TOKENS(TOKENPTR);        ! current token
   %if curr token=238  %start;           ! 238 is 'INTEGER'
      ALIST(ALPTR)=1;                    ! FIRST ALT FOUND
      TOKEN PTR=TOKEN PTR+1;             ! TO NEXT TOKEN
      ALPTR=ALPR+1;                      ! TO RECEIVING NEXT
     %RESULT=1;                          ! success
   %finish
   %if curr token=239 %start;            ! 239 is 'REAL'
      ALIST(ALPTR)=2;                    ! 2nd Alt found
      TOKEN PTR=TOKEN PTR+1
      ALPTR=ALPTR+1
      %result=1
   %finish
   %if curr token=240 %start;            ! 240 is 'BOOLEAN'
       ALIST(ALPTR)=3;                   ! 3rd ALT found
       TOKEN PTR=TOKEN PTR+1
       ALPTR=ALPTR+1
       %result=1
   %finish
    %result=0;                           ! no valid alternative found
   %end                                  ! token ptr etc unchanged.

     There are about 100 phrases in  the  grammar  for  Algol  and  writing  100
functions  is  hard  work;   a better scheme is to write an interpreter function
which interprets a phrase definition.   An preprocessor turns the grammar into a
%const %short %integer %array GRAM (256:N).   Literal tokens appear in GRAM as a
token value;   phrases as a pointer in GRAM which must be >=256.   The entry for
phrase TYPE would look like:-
        __________
        !        !     
        !   3    !     Type has three alternatives.
        !--------!     
        !   1    !    The first alternative has 1 entry.  
        !--------!    
        !  238   !    The first alternative is token 'INTEGER'
        !--------!
        !   1    !    The 2nd alternative has 1 entry
        !--------!
        !  239   !    which is token 'REAL' etc.
        !--------!
        !   1    !
        !--------!
        !  240   !
        !________!


     It is easy to write an interpreter to work through this definition and have
the  same effect as function PTYPE coded above.   The alternative <empty> can be
represented by a zero length entry or otherwise.

     This method can not  deal  with  some  phrases,    particularly  names  and
constants;   these have to be provided by hand crafted functions.  The suggested
grammar has 31 such hand crafted or Built in Phrases (BIPs).   About half  these
are action phrases to generate a symbol table,   or dictionary,   while parsing.
Three are frig phrases to avoid backing up (CHKLAB, CHKLPL & NOMORE); two syntax
check names (NAME & ONAME) and two check and evaluate constants.  TEXTTEXT deals
with Algol strings while COMTEXT,   ENDTEXT and LETTERSTRING deal with varieties
of comments.

     Most of these BIPs make no entry in the tree - like literals they guide the
parser.  However,  names have an alternative entered, the value being the number
of the dictionary entry.   Constants and strings  are  entered  into  the  ALIST
preceeded by a type and length entry.

     All valid Algol declarations and statements are alternatives of  P(SS)  and
the  parser  starts  with  the  pointer  to this phrase which is provided by the
grammar pre-processor. It should return with result=1 where upon the alternative
list is linked into the workfile with pointers to its predecessor and successor.
If it fails some error recovery is needed before the next statement is parsed.

     Some of the finer points of parsing must wait until the  symbol  table  has
been described.