CS4:COMP LN7


                              THE LEXICAL ANALYSER


     The lexical analyser is that part  of  the  compiler  which  reads  in  the
original  source  text  and arranges it in a form for the parser.   It can be an
almost nonexistent entity - just handing over the symbols as read;   it  can  be
very  complex  and  do much of the work of the parser - identifying keywords and
building a dictionary of the names.

There are a number of reasons for having a separate lexical analyser:-

   1. Handling  characters  is  expensive  -  very  expensive  on   some
      architectures.    A separate scanner enables one to concentrate on
      doing the text handling efficiently.

   2. The tokens handed over to the parser  are  potentially  much  more
      useful  than  the  raw  characters.   This is particularly true of
      reserved word languages.

   3. Some languages come in different  hardware  representations.    By
      having  two  or  more  scanners  the parser is isolated from these
      problems.

   4. Certain awkward problems eg.   the Fortran DO 10 I=1....   may  be
      resolved more easily at the lexical level.

The scanner needed for Algol is of the simpler variety; it will normally

   - Discard spaces, tabs and newlines.

   - Reduce the keywords to a single token.

   - Resolve certain compound symbols eg. "(/" -> "[" and "**" -> "^"

   - Close "open" strings.

It will not do some things commonly associated with a scanner:-

   - Discard  comments  -  Algol   comments  are  difficult  to   recognise
     lexically.

   - Build the  names into  a dictionary  -  Algol dictionaries  are  block
     related.

     It is a very close decision whether the scanner should reduce two character
symbols  like  ":=" to a single token.   However,   top-down Parsers are good at
handling  extended  literals  but  poor  at   handling   alternative   literals.
Consequently I have never bothered to reduce compound symbols.

Scanners are often written as a co-routine of the parser.

          __________          __________          __________
          !        !          !        ! <- next? !        !
          !  Text  !..........! Scanner!          ! Parser !
          !        !          !        !  token ->!        !
          ----------          ----------          ----------

     However,  on a virtual memory machine it is easiest to scan the whole input
program and place the output tokens into a byte array mapped onto the work file.
Algol programs have no terminator - you will  have  to  trap  the  end  of  file
condition and add a special token to the end of the text.

     ASCII characters take values 0:127 so that leaves  values  of  128-255  for
compound  tokens.    A  scanner is normally programmed with the aid of a look up
table from 0:127 with a "class" for each ascii character.

              Class 0     characters are discarded

              Class 1     characters are passed across unchanged.

              Class 2 up  are specials and control passes to ad
                          hoc code.


     The only difficult classes are those for prime which introduces  a  keyword
and open curly bracket for strings.

     To deal with a keyword assemble the characters into an Imp string and  look
it  up  in  an array.   If it is found enter the appropriate token into the text
array (two tokens for items like 'LE' which becomes <=) .  A fast look up scheme
is  important  but  it  is hardly worth hashing - separate by keyword length and
within those of the same length order by  frequency.    Wichmann  has  frequency
tables but the order is begin & end,  if & then, for & do, step & until with the
rest nowhere.   If the keyword is not found push it back character by  character
into the token array and let the parser fault it.

     Algol strings are strings  of  visible  character  and  are  passed  across
complete in the token array less any "invisibles".

     Having written a scanner you will need an expander to reproduce lines  with
errors  in  a  readable  form.    The expander is also useful when debugging the
scanner.

     Don't forget your keyword-to-token correspondence must be agreed  with  the
writers of the parser and grammer compressing routines.