$f12=16 s=3.0
$12
$a invert=0; cap=0; capsh=0; pgap=0; und=0
$a left=2; just=1
$a line=130
$b3 $l
HD/CS4/5   February 1984
$b $l1bm
Compiling Techniques -- Representation of Type Information
$b2 $v6 $l1b
Categories of identifier
$p The broadest distinction that can be drawn among identifiers
is one of $%category rather than type.
In the case of Pascal, the following six categories are distinguished
by the Zurich group:
$l0

       constant, variable, field, procedure, function, type
$b
If labels were identifiers in Pascal these would count as a distinct category
too.
$p The category distinctions are those which are often pre-supposed
in definitions of the syntax of the language, with the implication
that identifiers can be classified in these terms at the lexical
processing stage.
However, placing dictionary lookup in the lexical phase is problematic
for many languages,
since it usually implies some degree of feed-back from the syntactic phase to
distinguish the possible contexts of occurrence of identifiers.
Some of the cases which would require to be communicated for Pascal are:
$l0

    (a) declaration contexts:-
           top-level
           as field of record

    (b) semi-declaration contexts:-
           forward reference
           satisfying forward reference

    (c) use contexts:-
           entire
           as field of record (perhaps in presence of WITH)
           restricted scope

    (d) ambiguous contexts:-
           eg after CASE in RECORD format
$p
For these reasons, it is often preferable to stick to a single
encoding for 'identifier' from the lexical phase and adjust
the higher-level recognition routines to make a single
test for 'identifier', invoke the appropriate kind of lookup,
and then distinguish the cases. Depending on the language,
this may involve doing greater or less violence to the syntax
as it is normally presented.
$b2 $l1b
Data-structuring possibilities
$p One level down from the category distinctions come those
which may be called distinctions of structure, which classify
on the basis of properties of different kinds of structure,
or its absence.
$b
There are considered to be the following separate cases in Pascal (though
'subrange' is somewhat anomalous):-
$a tab=10,20,30
$b0$t+ scalar $t+ base  
$b0$t+ subrange $t+ $t+ simple
$b0$t+ pointer
$b0$t+ set
$b0$t+ array $t+ $t+ structured
$b0$t+ record
$b0$t+ file
$p Note that these distinctions, like category distinctions, bear upon issues
which must be regarded as syntactic in the broadest
sense -- whether an identifier may be followed by
an index expression or subfield, for example.
$b2 $l1b
Type distinctions
$p At the lowest level of the classification of identifiers
there are the basic distinctions of $%data-type,
which in Pascal appear as the following sub-classification
of the 'scalar' structure:-
$l0
       integer
       real
       char
       boolean
       <enumerated-types>
$p For simplicity, it is conventional to refer to all the distinctions
described in the preceding paragraphs as aspects of $%type in the
broadest sense.
$b3 $l1b
Internal representation of type information
$p All these distinctions must be represented
in some form in the internal data structures which the compiler
must maintain to keep track of type information and identifier declarations.
As well as representing type distinctions for declared identifiers,
it is also necessary that they can be associated with literals and
with derived compound expression -- or at least the relevant
subset which is applicable to these case.
The design of these structures is an important part of the design
of a compiler for a modern programming language.
$p Early compilers, dealing with languages
which had a few fixed types, could simply assign a small
field in a fixed identifier record to contain a single code.
With languages like Pascal, it is necessary to cope with open-ended
types (like subranges and enumerations) and types of considerable structural
complexity. In the context of a full programming environment,
there is also the consideration that these data structures are not
private to the compiler, since they must be exported in whole
or part for other purposes such as diagnostics and external linkage.
(Obviously, it is not essential
that they are exported in
precisely the same form as they take within the compiler, but
if account is not taken of these requirements in designing
them, unnecessary problems may be created). Finally,
as if dealing with one language were not enough of a problem,
there may be a requirement to fit within a multi-lingual framework,
to permit cross-calling and common diagnostics.
$b2 $l1b
Problems of type representation
$p In general terms, the representations chosen must satisfy the
requirement of being
$l2
           faithful to the language
           convenient for implementation.
Depending on how well-defined the language is, it may be quite
difficult to determine just what the requirements are. An example
is the question whether a variable declared via a type identifier
has to be regarded as belonging to a different type than a variable
declared directly in precisely the same form as the declaration
of the type identifier.
If they are required to be treated as being of different type, a
representation which loses this distinction is ruled out.
$p One guiding principle is to remain close to the source language
declaration -- after all, the relevant information is certainly all
there. But excessive adherence to this principle will lead to
unacceptably inefficient processing.
$p Another problem is that, while it is often the case that the
majority of the typing possibilities in a language can be
covered in a reasonably neat and elegant fashion, there are often
a few anomalous cases which are very difficult to fit in.
Examples are:
         'packed' data structures
          strings, especially flexible length strings
          union or universal types
          indeterminate literal types
$p In this connection, a multi-lingual objective can be a help as
well as constraint, because it not infrequently happens that,
by looking at a requirement more generally than the
form of it found in an individual language strictly needs, we may find a more
elegant solution than one that fits just the cases which occur. This is
particularly likely
where the language involved imposes arbitrary restrictions.
Unfortunately, it sometimes happens that the arbitrary
restrictions are built rather firmly into the definition
of the language, so that it is problematic to implement certain
features correctly using a more general approach.
$p When it comes to convenience and efficiency of implementation,
it may be found that the most straightforward structures to define
are less than ideal in practice. It is necessary to look closely
at the various forms of type compatibility checks which are
required by the language and verify that they can be performed
without excessive complexity and with acceptable efficiency.
These are often in terms of such concepts as:
$l0
            type identity
            structural compatibility
            assignment compatibility
            operation compatibility
$p It is also necessary to analyse the major type diferentiations
which are required to guide the process of compilation with a view to ensuring
that these also are straightforward and that the commonest cases
are efficient. How many pointers have to be followed and tests made to find out
that an operand is an integer? a string? of ordinal type? of arithmetic type?
within a given range?
$p Certainly, when it comes to code generation, the discriminations
which have to be made tend to cut across the type distinctions and
it is likely, at the expense of some redundancy, to have a separate
representation of the machine-level properties of an operand,
in particular its storage size and access mode.