CS4:COMP LN7 THE LEXICAL ANALYSER The lexical analyser is that part of the compiler which reads in the original source text and arranges it in a form for the parser. It can be an almost nonexistent entity - just handing over the symbols as read; it can be very complex and do much of the work of the parser - identifying keywords and building a dictionary of the names. There are a number of reasons for having a separate lexical analyser:- 1. Handling characters is expensive - very expensive on some architectures. A separate scanner enables one to concentrate on doing the text handling efficiently. 2. The tokens handed over to the parser are potentially much more useful than the raw characters. This is particularly true of reserved word languages. 3. Some languages come in different hardware representations. By having two or more scanners the parser is isolated from these problems. 4. Certain awkward problems eg. the Fortran DO 10 I=1.... may be resolved more easily at the lexical level. The scanner needed for Algol is of the simpler variety; it will normally - Discard spaces, tabs and newlines. - Reduce the keywords to a single token. - Resolve certain compound symbols eg. "(/" -> "[" and "**" -> "^" - Close "open" strings. It will not do some things commonly associated with a scanner:- - Discard comments - Algol comments are difficult to recognise lexically. - Build the names into a dictionary - Algol dictionaries are block related. It is a very close decision whether the scanner should reduce two character symbols like ":=" to a single token. However, top-down Parsers are good at handling extended literals but poor at handling alternative literals. Consequently I have never bothered to reduce compound symbols. Scanners are often written as a co-routine of the parser. __________ __________ __________ ! ! ! ! <- next? ! ! ! Text !..........! Scanner! ! Parser ! ! ! ! ! token ->! ! ---------- ---------- ---------- However, on a virtual memory machine it is easiest to scan the whole input program and place the output tokens into a byte array mapped onto the work file. Algol programs have no terminator - you will have to trap the end of file condition and add a special token to the end of the text. ASCII characters take values 0:127 so that leaves values of 128-255 for compound tokens. A scanner is normally programmed with the aid of a look up table from 0:127 with a "class" for each ascii character. Class 0 characters are discarded Class 1 characters are passed across unchanged. Class 2 up are specials and control passes to ad hoc code. The only difficult classes are those for prime which introduces a keyword and open curly bracket for strings. To deal with a keyword assemble the characters into an Imp string and look it up in an array. If it is found enter the appropriate token into the text array (two tokens for items like 'LE' which becomes <=) . A fast look up scheme is important but it is hardly worth hashing - separate by keyword length and within those of the same length order by frequency. Wichmann has frequency tables but the order is begin & end, if & then, for & do, step & until with the rest nowhere. If the keyword is not found push it back character by character into the token array and let the parser fault it. Algol strings are strings of visible character and are passed across complete in the token array less any "invisibles". Having written a scanner you will need an expander to reproduce lines with errors in a readable form. The expander is also useful when debugging the scanner. Don't forget your keyword-to-token correspondence must be agreed with the writers of the parser and grammer compressing routines.