$f12=16 s=3.0 $12 $a invert=0; cap=0; capsh=0; pgap=0; und=0 $a left=2; just=1 $a line=130 $b3 $l HD/CS4/5 February 1984 $b $l1bm Compiling Techniques -- Representation of Type Information $b2 $v6 $l1b Categories of identifier $p The broadest distinction that can be drawn among identifiers is one of $%category rather than type. In the case of Pascal, the following six categories are distinguished by the Zurich group: $l0 constant, variable, field, procedure, function, type $b If labels were identifiers in Pascal these would count as a distinct category too. $p The category distinctions are those which are often pre-supposed in definitions of the syntax of the language, with the implication that identifiers can be classified in these terms at the lexical processing stage. However, placing dictionary lookup in the lexical phase is problematic for many languages, since it usually implies some degree of feed-back from the syntactic phase to distinguish the possible contexts of occurrence of identifiers. Some of the cases which would require to be communicated for Pascal are: $l0 (a) declaration contexts:- top-level as field of record (b) semi-declaration contexts:- forward reference satisfying forward reference (c) use contexts:- entire as field of record (perhaps in presence of WITH) restricted scope (d) ambiguous contexts:- eg after CASE in RECORD format $p For these reasons, it is often preferable to stick to a single encoding for 'identifier' from the lexical phase and adjust the higher-level recognition routines to make a single test for 'identifier', invoke the appropriate kind of lookup, and then distinguish the cases. Depending on the language, this may involve doing greater or less violence to the syntax as it is normally presented. $b2 $l1b Data-structuring possibilities $p One level down from the category distinctions come those which may be called distinctions of structure, which classify on the basis of properties of different kinds of structure, or its absence. $b There are considered to be the following separate cases in Pascal (though 'subrange' is somewhat anomalous):- $a tab=10,20,30 $b0$t+ scalar $t+ base $b0$t+ subrange $t+ $t+ simple $b0$t+ pointer $b0$t+ set $b0$t+ array $t+ $t+ structured $b0$t+ record $b0$t+ file $p Note that these distinctions, like category distinctions, bear upon issues which must be regarded as syntactic in the broadest sense -- whether an identifier may be followed by an index expression or subfield, for example. $b2 $l1b Type distinctions $p At the lowest level of the classification of identifiers there are the basic distinctions of $%data-type, which in Pascal appear as the following sub-classification of the 'scalar' structure:- $l0 integer real char boolean $p For simplicity, it is conventional to refer to all the distinctions described in the preceding paragraphs as aspects of $%type in the broadest sense. $b3 $l1b Internal representation of type information $p All these distinctions must be represented in some form in the internal data structures which the compiler must maintain to keep track of type information and identifier declarations. As well as representing type distinctions for declared identifiers, it is also necessary that they can be associated with literals and with derived compound expression -- or at least the relevant subset which is applicable to these case. The design of these structures is an important part of the design of a compiler for a modern programming language. $p Early compilers, dealing with languages which had a few fixed types, could simply assign a small field in a fixed identifier record to contain a single code. With languages like Pascal, it is necessary to cope with open-ended types (like subranges and enumerations) and types of considerable structural complexity. In the context of a full programming environment, there is also the consideration that these data structures are not private to the compiler, since they must be exported in whole or part for other purposes such as diagnostics and external linkage. (Obviously, it is not essential that they are exported in precisely the same form as they take within the compiler, but if account is not taken of these requirements in designing them, unnecessary problems may be created). Finally, as if dealing with one language were not enough of a problem, there may be a requirement to fit within a multi-lingual framework, to permit cross-calling and common diagnostics. $b2 $l1b Problems of type representation $p In general terms, the representations chosen must satisfy the requirement of being $l2 faithful to the language convenient for implementation. Depending on how well-defined the language is, it may be quite difficult to determine just what the requirements are. An example is the question whether a variable declared via a type identifier has to be regarded as belonging to a different type than a variable declared directly in precisely the same form as the declaration of the type identifier. If they are required to be treated as being of different type, a representation which loses this distinction is ruled out. $p One guiding principle is to remain close to the source language declaration -- after all, the relevant information is certainly all there. But excessive adherence to this principle will lead to unacceptably inefficient processing. $p Another problem is that, while it is often the case that the majority of the typing possibilities in a language can be covered in a reasonably neat and elegant fashion, there are often a few anomalous cases which are very difficult to fit in. Examples are: 'packed' data structures strings, especially flexible length strings union or universal types indeterminate literal types $p In this connection, a multi-lingual objective can be a help as well as constraint, because it not infrequently happens that, by looking at a requirement more generally than the form of it found in an individual language strictly needs, we may find a more elegant solution than one that fits just the cases which occur. This is particularly likely where the language involved imposes arbitrary restrictions. Unfortunately, it sometimes happens that the arbitrary restrictions are built rather firmly into the definition of the language, so that it is problematic to implement certain features correctly using a more general approach. $p When it comes to convenience and efficiency of implementation, it may be found that the most straightforward structures to define are less than ideal in practice. It is necessary to look closely at the various forms of type compatibility checks which are required by the language and verify that they can be performed without excessive complexity and with acceptable efficiency. These are often in terms of such concepts as: $l0 type identity structural compatibility assignment compatibility operation compatibility $p It is also necessary to analyse the major type diferentiations which are required to guide the process of compilation with a view to ensuring that these also are straightforward and that the commonest cases are efficient. How many pointers have to be followed and tests made to find out that an operand is an integer? a string? of ordinal type? of arithmetic type? within a given range? $p Certainly, when it comes to code generation, the discriminations which have to be made tend to cut across the type distinctions and it is likely, at the expense of some redundancy, to have a separate representation of the machine-level properties of an operand, in particular its storage size and access mode.