This document is a transcription of these scans.
The production of Optimised Machine-Code
for High-Level Languages using
Machine-Independent Intermediate Codes.
Peter Salkeld Robertson
Ph. D.
University of Edinburgh
1981
(Graduation date July 1981)
ABSTRACT
The aim of this work was to investigate the problems
associated with using machine-independent intermediate codes
in the translation from a high-level language into machine
code, with emphasis on minimising code size and providing
good run-time diagnostic capabilities.
The main result was a machine-independent intermediate
code, I-code, which has been used successfully to develop
optimising and diagnostic compilers for the IMP77 language
on a large number of different computer systems. In
addition, the work has been used to lay the foundations for
a project to develop an intermediate code for portable
SIMULA compilers.
The major conclusions of the research were that carefully
designed machine-independent intermediate codes can be used
to generate viable optimising and diagnostic compilers, and
that the commonality introduced into different code
generators processing the code for different machines
simplifies the tasks of creating new compilers and
maintaining old ones.
1. Introduction
Compilers for high-level languages form a significant
part of most computer systems, and with an ever increasing
number and variety of machine architectures on the market
the problems of compiler development, testing, and
maintenance consume more and more manpower and computer
time. Moreover, as computer technology is improving and
changing rapidly it is becoming evident that software costs
will increasingly dominate the total cost of a system.
Indeed, it may not be long before the lifetime of software
regularly exceeds that of the hardware on which it was
originally implemented, a state of affairs quite different
from that envisaged by Halpern when he concluded that "the
importance of the entire question of machine-independence is
diminishing .." [Halpern, 1965]. In addition, there is a
need to encourage the slowly-developing trend to write the
majority of software in high-level languages. Even though
the advantages of such an approach are many, a large number
of users still have a love of machine-code, usually fostered
by thoughts of "machine efficiency". Clearly, techniques
must be developed to simplify the production of usable
compilers which can "optimise" the match between the
executing program and the user's requirements, be they for
fast execution, small program size, reasonable execution
time but with good run-time diagnostics, or whatever.
One popular method for reducing the complexity of a
compiler is to partition it into two major phases: one
language-dependent and the other machine-dependent. The
idea is that the language-dependent phase inputs the source
program and deals with all the syntactic niceties of the
language, finally generating a new representation of the
program, an intermediate code. This is then input by a
second phase which uses it to generate machine-code for the
target computer. In this way it should be possible to
produce a compiler to generate code for a different machine
by taking the existing first phase and writing a new second
phase. This ability to move a large portion of the compiler
from machine to machine has led to such compilers being
referred to as "portable compilers" even though the term is
perhaps misleading, as only part of the complete compiler
can be moved without change. In practice many existing
compilers generate intermediate representations of the
program which are passed around within the compiler, for
example the "analysis records" produced by the syntactic
phase of compilation, but for the purposes of this work it
is only when these representations are machine-independent
and are made available outwith the compiler that they will
be termed intermediate codes.
Much of the emphasis in designing intermediate codes has
been on enabling a compiler to be bootstrapped quickly onto
a new machine - either by interpreting the intermediate
code, or by using a macro generator to expand it into
machine-code [Brown, 1977]. Once this has been done the
intention is that the quality of the code so produced can be
improved at leisure. While this approach has been very
successful and relatively error-free, it has been the
experience of several implementors that it is difficult to
adapt the scheme to produce highly optimised code [Russell,
1974]; apparently considerations of portability and
machine-independence have caused the problems of
optimisation to be overlooked. The aspect of
intermediate-code design which has received most debate
concerns the level of the code: low-level with a fairly
simple code-generator, or high-level with a more complex
code-generator [Brown, 1972].
This thesis attempts to put machine-independence and
optimisation on an equal footing, and describes the use of
an intermediate code which takes a novel view of the
process. Instead of the intermediate code describing the
computation to be performed, it describes the operation of a
code-generator which will produce a program to perform the
required computation. This effectively adds an extra level
of indirection into the compilation, weakening any linkage
between the form of the intermediate code and the object
code required for a particular implementation.
In essence I-code attempts to describe the results required
in a way which does not constrain the method of achieving
those results.
In particular it should be noted that the code described,
I-code, was designed specifically for the language IMP-77, a
systems implementation language which contains many of the
constructions which pose problems for optimisation
[Robertson, 1979]. It in no way attempts to be a
"universal" intermediate code. Notwithstanding, the code,
with a small number of minor extensions to cover non-IMP
features, has been used successfully in an ALGOL 60 compiler
and is currently proving viable in projects for writing
Pascal and Fortran 77 compilers.
The intermediate code as finally designed is completely
machine independent, except inasmuch as the source program
it describes is machine dependent, demonstrating that the
problems may not be as intractable as thought by Branquart
et al. who state that "clearly complete machine independency
is never reached" [Branquart, 1973].
In addition to the problems of machine independence there
is also the question of operating system independence, as
nowadays it is common for machines to have several systems
available. For this reason the task of producing a compiler
is far from finished when it can generate machine code
[Richards, 1977]. To simplify the generation of versions of
a compiler for different operating systems a third phase of
compilation was added, although it soon became clear that
the extra phase could be used for other purposes as well, as
will be shown later (section 4).
Throughout the text examples are given of the code
produced by compilers written to demonstrate the power of
the intermediate code. The examples of the intermediate
code are couched in terms of mnemonics for the various code
items, although the production compilers use a compacted
representation. The code and its representations are
described in Appendix A1 and Appendix A2.
In the examples of code generated for various
constructions it should be appreciated that the exact
instructions and machine features used will depend very much
on the context in which the code is produced, and so only
typical code sequences can be given.
The machines for which code is demonstrated are indicated
by the following abbreviations in parentheses:
(Nova) Data General NOVA
(PDP10) Digital Equipment Corporation PDP10
(PDP11) Digital Equipment Corporation PDP11
(VAX) Digital Equipment Corporation VAX 11/780
(GEC4080) General Electric Company 4080
(ICL2900) International Computers Limited 2900
(4/75) International Computers Limited 4/75
(7/16) Interdata 7/16
(7/32) Interdata 7/32
(PE3200) Perkin Elmer 3200
2 Intermediate codes
This section gives a brief account of the more important
intermediate codes which have been discussed and have had an
influence on the design of I-code.
2.1 Uncol
UNCOL, UNiversal Computer Orientated Language, [Mock,
1958], was an early attempt to specify a means for solving
the M*N problem of producing compilers for M languages to
run on N machines. It was proposed that an intermediate
language, UNCOL, be defined which would be able to express
the constructs from any language, and which could itself be
translated into code for any machine, resulting in the need
for only M+N compilers. Indeed it was even suggested that
programs would be written directly in UNCOL rather than in
machine code.
These ideas were very ambitious, but were presented without
any concrete examples of what UNCOL might look like.
Proposals were made for an UNCOL in [Steel, 1961] but the
work was abandoned before anything like a complete
specification had been produced.
An UNCOL-like technique which has been used extensively
is to compile for a known type of machine, such as the IBM
360, and then emulate that machine on the target machine.
Unfortunately, to give this any chance of being efficient,
microcode support will be necessary, and this is rarely
available to compiler writers.
2.2 Janus
The first attempt at generating an UNCOL which seems to
have been at least partially successful was JANUS [Coleman,
1974]. The approach was effectively to enumerate all the
mechanisms found in current programming languages and the
techniques used to implement them. From this large list was
defined a set of primitive data-types and operations upon
them. These primitives were then put together to model the
objects in the source language. Once JANUS code had been
produced the intention was that it would either be
interpreted or compiled into machine code by a macro
generator.
2.3 OCODE
Of all the languages which claim to be portable, perhaps
the most successful has been BCPL [Richards, 1971]. The
BCPL compiler generates the intermediate code OCODE which
can either be interpreted or translated into machine code
for direct execution. As BCPL is a fairly low-level
language with only one data type, the word, many of the
difficulties in designing intermediate codes do not arise.
This means that the code can be pitched at a low level and
be "semantically weak" without compromising the efficiency
of the compiled code to any great extent.
The OCODE machine works by manipulating single-word objects
held on a stack, into which there are several pointers.
e.g. R(1, 2, 3)
STACK 3 adjust the top of stack to leave two
cells free for linkage information.
LN 1 stack the constant 1.
LN 2 stack the constant 2.
LN 3 stack the constant 3.
LL L6 stack the address of label L6 (the entry
to the routine).
RTAP 5 enter the procedure adjusting the stack
frame pointer by 5 locations.
.....
.....
.....
ENTRY 1 L6 'R'
entry point for the routine R.
SAVE 5 set the top of stack pointer to be 5
locations from the stack frame pointer.
.....
RTRN return.
2.4 P-code
P-code is the intermediate code used by the PASCAL<P>
compiler [Nori, 1976; Jensen, 1976] and was designed with
the aim of porting PASCAL quickly by means of an
interpreter. In this respect it has been very successful,
especially on microprocessor-based systems. The code is
similar to OCODE but has a greater range of instructions to
handle objects of differing types.
procedure ERROR(VAL:INTEGER); begin
0: ENT 4
TOTAL := TOTAL+1;
1: LDO 138
2: LDCI 1
3: ADDI
4: SRO 138
if INDEX >= 9 then begin
5: LDO 139
6: LDCI 9
7: GEQI
8: FJP 17
LIST[10].NUM := 255
9: LAO 140
10: LDCI 10
11: DEC 1
12: IXA 2
13: INC 1
14: LDCI 255
15: STO
end else begin
16: UJP 28
INDEX := INDEX+1;
17: LDO 139
18: LDCI 1
19: ADDI
20: SRO 139
LIST[INDEX].NUM := VAL
21: LAO 140
22: LDO 139
23: DEC 1
24: IXA 2
25: INC 1
26: LOD 0, 4
27: STO
end;
end;
28: RETP
2.5 Z-code
Z-code [Bourne, 1975] is the intermediate code produced
by the ALGOL68C compiler, the main feature of which is the
ability for the user to parameterise the first phase to
modify the Z-code to suit the target machine, an idea
previously investigated in SLANG [Sibley, 1961]. A set of
eight registers is assumed by the code and others may be
specified explicitly for each transfer. The memory with
which the code works is assumed to be "a linear data store
that is addressed by consecutive integers", addresses taking
the form of base+displacement pairs. Intermingled with the
instructions are directives which control the translation of
the code into machine orders. Two of these directives are
used to divide the code into "basic blocks" or
"straight-line segments", and describe the usage of
registers on entry to and exit from the blocks, although
little use seems to be made of them at present.
As an example here is the Z-code generated by the PDP10
version of the compiler [Gardner, 1977]:
int X := 2, Y := 3, Z := 2
1: F000 10 0 +2 load 2
F040 10 6 +144 store in X
F000 10 0 +3 load 3
F040 10 6 +145 store in Y
5: F000 10 0 +2 load 2
F040 10 6 +146 store in Z
proc P = (int A, B) int: begin
7: S715 p*Z
T246 677 712
if A > B
9: R0
10: F020 10 5 +4 load A
11: F022 10 5 +5 subtract B
12: F113 10 0 P713 ->L713 if <=
then A
13: F020 10 5 +4 load A
else B
14: H116 0 p7l4 ->L714
15: L713
16: F020 10 5 +5 load B
fi
17: L714
18: R1 10 1
end
19: R1 10 1
20: T247 667 712 end of P
2.6 Summary and conclusions
2.6.1 Error checking and reporting
The UNCOL approach of having one code for all languages
and machines may well simplify the generation of some sort
of compiler, but has the major disadvantage that the
optimisation of error checking and reporting run-time errors
cannot be left to the code generator - many errors are
language-dependent and the code generator cannot know how to
handle all of them. Instead the checks must be programmed
into the intermediate representation explicitly. As will be
shown later (5.3) this can inhibit several very powerful and
effective optimisations. Sadly, this problem can result in
the absence of all but the most trivial of run-time checks
in the compiled code.
Even when checking is provided in the intermediate code,
as in the case of P-code with its CHK instruction for range
testing, it is rare for the code to contain enough
information to permit the error to be reported in source
program terms; line numbers, procedure names, variable names
and values etc. As an example, many P-code interpreters
locate run-time errors in terms of 'P-code instruction
addresses' which are of negligible benefit to most users.
2.6.2 Efficiency
Commonly, little attention is paid to questions of
run-time efficiency in the generation of intermediate code.
An exception to this is Z-code which is parameterised in
order that the match between the code and the target machine
can be improved. In particular, the machine-independent
phase is intended to perform simple register optimisation,
although as the example in 2.5 shows, the insistence on
using the same register will minimise any gains from
remembering register contents. However, this is probably
just a failure on the part of the current compilers and
could be corrected at a later date. Unfortunately, the fact
that the compiler purports to optimise the intermediate code
inhibits the code generator from attempting any but the most
trivial peephole optimisations. This may be seen in the
example by considering instructions 10 - 12. On many
machines the subtract operation is not good for value
comparison as firstly it may fail with overflow, and
secondly it corrupts a register. A better implementation
would be to replace the subtract with a suitable COMPARE,
leaving the values untouched and available for later use.
This cannot be done by the code generator as it cannot know
that the intermediate code does not go on to use the result
of the subtraction later. Similarly, if Z-code had chosen
to use a COMPARE instruction in the first place, a machine
without a compare would have to work hard to make sure all
registers involved in the necessary subtract were restored
optimisations which have been attempted [Aho, 1974; Lowry,
1969; Wichmann, 1977]. However most optimisations fall into
one of the following four groups: Universal, Local, Global
and Source.
3.1.1 Universal Optimisations
These are those optimisations which are
independent of any particular program, but which
depend on the complete environment in which the
program is to be compiled or executed. They are the
policy decisions taken by the compiler writer during
the initial design of the compiler, and include such
things as the fixed use of registers (stack
pointers, code pointers, link registers etc), the
representations of language-defined objects (arrays,
records, strings etc), and the standards for
communication with external objects.
In addition, universal optimisation must take
into account such questions as:
i Compilation speed or execution speed?
If the compiler is to be used in an
environment where programs are compiled
roughly as often as they are executed, such
as in a teaching environment, execution
speed can be sacrificed for an increase in
compilation speed.
ii Diagnostics?
If the compiler is to produce code which
will provide extensive checking and will
give diagnostic information in the event of
program failure, allowance must be made for
the efficient checking of the program's
behaviour and the maintenance of the
recovery information used by the
diagnostics. If highly optimised code is
required these constraints may not apply.
In the current state of the art universal
optimisation is done by experience and guesswork;
attempts at producing compiler-compilers which can
approach the quality of hand-constructed compilers
have not met with great success [Brooker, 1967;
Feldman, 1966; Trout, 1967]. As will be shown later
(4.5), minor changes in the universal optimisation
can result in major changes in the performance of
the generated code, and so rules made at this stage
should be as flexible as possible to permit changes
to be made in the light of experience.
From the point of view of measurement universal
optimisation provides the base level from which
other optimisations are investigated. Roughly, the
better the universal optimisation the less effective
the other optimisations appear to be.
For example:
----------------
| X = Y |
| if X = 0 start |
-----------------
on the PDP11 would generate:
------------
| MOV Y,X |
| BNE $1 | remembering that the previous
| | line sets the condition code.
------------
The most powerful of the remembering
optimisations is that whereby the correspondence
between values in registers and values in store is
remembered and references to the store value are
replaced by references to the register's value,
register operations usually being smaller and faster
than their store equivalents. Unfortunately there
are several cases where this leads to worse code
than the "obvious" version. For example, on the
(PE3200) the code on the right is larger and slower
than that on the left:
------------
| X = 2 |
| P = P<<2 |
-----------
----------- --------------
| LIS 3,2 | LIS 3,2 | pick up 2
| ST 3,X | ST 3,X | store it in X
| L 1,P | L 1,P | pick up P
| SLLS 1,2 | SLL 1,0(3) | shift it by 2
| ST 1,P | ST 1,P | store it in P
----------- --------------
In addition to keeping track of the changes in
the state of the machine from the compilation of one
statement to another, remembering also includes
preserving this state or environment for later use
when a label is encountered, either by merging the
current environment with the environment saved at
the jump to the label, or simply by restoring that
latter environment when it is not possible for
control to "fall through" from the statements
immediately preceding the label.
In all forms of remembering it is vital to be
able to keep the information up-to-date,
invalidating knowledge when it becomes false, a
process which is exacerbated when it is possible for
an object to be known by two or more apparently
different descriptions as in the following code:
---------------------------
| integer J, K |
| integerarray A(1:12) |
| integername P |
| P == J |
| J = 1; K = 1 |
---------------------------
At this point P and J refer to the same location as
do A(J) and A(K).
Except in the most simple of cases all that can be
done is to assume the worst and forget anything
potentially dangerous after writing to unknown
addresses.
By delaying the first store into T until after
the conditional statement, and delaying the second
store into T until after the label, both
instructions can be combined, resulting in the code:
-------------------
| L 3,X |
| L 0,0(3) |
| BGE $1 |
| SR 0,0 |
| $1:ST 0,T |
| LIS 2,1 |
| ST 2,0(3) |
| LR 1,0 |
| {return} |
-------------------
This store itself can now be delayed until the
return from the function, at which point, as T is
local to the function and will be destroyed, the
instruction can be deleted altogether.
Section 3.2 gives a description of one way in which
this sort of optimisation has been achieved.
3.1.2.3 Inaccessable code removal
In several cases compilers can generate code
which will never be executed.
The common causes of this are either user-specified
conditions whose truth is constant and are used to
achieve some sort of "conditional compilation", or
structural statements following unconditional jumps
as below:
----------------------
| if X = 0 start |
| ERROR(1) |
| return |
| else |
| V = V-X |
| finish |
----------------------
Here the branch instruction usually generated at the
end of the if clause to take control past the else
clause, can never be executed.
Such inaccessable code can be eliminated to shorten
the program, but without directly effecting its
speed of execution.
3.1.2.4 Peephole optimisations
Peephole optimisation [McKeeman, 1965] is the
technique of examining small sections of generated
code to make fairly obvious, but ad hoc,
improvements. Many of the gains from the
optimisation come from simplifying code sequences
created by the juxtaposition of code areas which
were produced separately.
For example (PE3220):
Before After
---------------- -------------------
| ST 4,X | ST 4,X |
| L 4,X | |
|- -|- -|
| AR 1,2 | |
| AHI 1,48 | AHI 1,48(2) |
---------------- -------------------
3.1.2.5 Special cases
Special-case optimisations are those which make
use of the particular structure and features of the
target machine to control the way in which certain
statements are implemented.
For example:
Obvious Optimised
---------------- -----------
(PDP11) | MOV #0,X | CLR X | X = 0
|- -|- -|
(PDP11) | ADD #1,X | INC X | X = X+1
|- -|- -|
(PE3220) | LHI 1,NULL | SR 0,0 | S = ""
| LHI 2,S | STB 0,S |
| BAL 15,MOVE | |
----------------------------
These optimisations are very similar to peephole
optimisations but are distinguished because they
actively control the generation of code rather than
passively alter code which has already been
produced. In particular they avoid one of the
drawbacks of peephole optimisation, namely that even
though it can reduce fairly complex instruction
sequences to a simpler form, the side-effects of
generating the long form in the first place often
degrade the code. In the example of setting a
string variable to the null string above the
optimised form uses only one register, the value of
which can be remembered. In the non-optimised
version three registers are immediately altered and
the knowledge of the contents of all of the
registers may need to be forgotten unless the code
generator knows how the MOVE routine works and can
forget only those registers which it uses.
3.1.2.6 Algebraic manipulations
Algebraic optimisations are improvements brought
about by using the algebraic properties of operators
and operands, and include:
. Folding, or compile-time evaluation
1+2 is replaced by 3
. Removal of null operations
A+0 is replaced by A
. Using commutative properties
-B+A is replaced by A-B
could be improved by reordering as on the right
(PDP11):
original reordered
------------------ ----------------
| MOV X,R0 | MOV X,R0 |
| BEQ $1 | BEQ $1 |
| JMP $2 | {B} |
| $1: {A} | JMP $2 |
| BR $3 | $1: {A} |
| $2: {B} | $2: |
| $3: | |
------------------ ----------------
3.1.3.2 Merging
3.1.3.2.1 Forward merging
Forward merging, also somewhat confusingly
referred to as "cross jumping" [Wulf, 1975], is the
process whereby the point of convergence of two or
more code sequences is moved back over common
sub-sequences thus removing one of the
sub-sequences, as in the case below.
----------------------
| if X > Y start |
| TEST(X, Y) |
| else |
| TEST(Y, X) |
| finish |
----------------------
obvious code (VAX) after merging
--------------------- --------------------
| CMPL X,Y | CMPL X,Y |
| BLE $1 | BLE $1 |
| PUSHL X | PUSHL X |
| PUSHL Y | PUSHL Y |
| CALLS 2,TEST | |
| BRB $2 | BRB $2 |
| $1:PUSHL Y | $1:PUSHL Y |
| PUSHL X | PUSHL X |
| CALLS 2,TEST | $2:CALLS 2,TEST |
| $2: | |
--------------------- --------------------
The simplest way to perform this optimisation is to
take the code sequence about the point of a label
and a reference to that label, and set two pointers:
one, P., to the unconditional jump and the other,
P2, to the label. If the instructions immediately
preceding the pointers are identical both pointers
are moved back over that instruction. The label is
redefined at the new position of P2 and the
instruction passed over by P1 is deleted. The
process is repeated until either another label is
found or two different instructions are encountered.
The redefinition of the label involves creating a
completely new label, leaving the old one untouched.
This both prevents trouble with multiple references
to the label and permits the optimisation to be
attempted on them.
As this optimisation simply causes the sharing of
execution paths there is no direct gain in execution
speed, but as the code size is reduced an indirect
improvement may be achieved if the shorter code
moves the label close enough to the reference to it
for a shorter and usually faster jump instruction to
be used.
The optimisation obviously must be performed while
labels and jumps are in a symbolic form, that is
before code addresses have been resolved. This
permits the merging of instructions which will
eventually have program-counter relative operands
and consequently be position dependent.
3.1.3.2.2 Backward merging
A second, but much more difficult form of merging
involves moving instructions back over the preceding
branch code which generates the two paths being
considered.
Original (PE3200) Optimised
------------------- -------------------
| | L 2,R |
| L 1,X | L 1,X |
| BNE $1 | BNE $1 |
P1 -> | L 2,R | |
| LIS 3,1 | LIS 3,1 |
| ST 3,A(2) | ST 3,A(2) |
| B $2 | B $2 |
P2 -> | $1:L 2,R | $1: |
| LIS 3,3 | LIS 3,3 |
| ST 3,B(2) | ST 3,B(2) |
| $2: | $2: |
------------------- -------------------
The difficulty with this optimisation is that it
requires the branch and the associated condition
testing code to be treated as a single unit, so that
merged instructions do not split the test and the
use of the result. Also the testing instructions
must be checked to ensure that they are not able to
modify the operands of the merged instructions.
This information is easily available to the
code-generator as in IMP77 only procedure calls and
string resolution can have such side-effects. In a
way similar to the other form of merging the two
pointers, P1 and P2 are set and adjusted; PI being
moved forward over common code carrying the branch
sequence with it (L & BNE), and P2 being advanced,
deleting the code it passes over.
3.1.3.4 Factoring
Factoring is the generalisation of merging and
involves the removal of common sections of code.
Included under this heading is the elimination of
common sub-expressions. At the source level this can
be seen in changes such
as:
-----------------------------
| D = SIN(X^2) + COS(X^2) |
-----------------------------
being replaced by
----------------------------
| real T |
| T = X^2 |
| D = SIN(T) + COS(T) |
----------------------------
At the machine level the optimisation is often
available as the result of address arithmetic in the
case of simple arrays:
-----------------
| A(J) = B(J) |
-----------------
Original (PE3200) Optimised
----------------- -----------------
| L 1,J | L 1,J |
| SLLS 1,2 | SLLS 1,2 |
| AR 1,LNB | AR 1,LNB |
| L 3,J | |
| SLLS 3,2 | |
| AR 3,LNB | |
| L 0,B(3) | L 0,B(1) |
| ST 0,A(1) | ST 0,A(1) |
-----------------------------------
In this case as the code-generator is in complete
control the optimisation can be very simple,
although rather specific.
The techniques for handling common sub-expressions
have been investigated at length by several authors,
but measurements indicate that in most programs
expressions are so trivial the expense in finding
common sub-expressions is not repaid by the
resulting improvement in the generated code [Knuth,
1971].
The more general form of factoring can be seen in
the transformation of the following statements:
---------------------------------------
| if X = 0 then A = 1 else B = 2 |
| if X = 0 then C = 3 else D = 4 |
---------------------------------------
into:
-----------------------
| if X = 0 start |
| A = 1 |
| C = 3 |
| else |
| B = 2 |
| D = 4 |
| finish |
-----------------------
3.1.3.6 Expansion
Expansion is the process of rewriting compact
representations of parts of a program in a more
explicit form, usually resulting in faster execution
but at the expense of more code. The two main uses
of expansion are to reduce the overheads in loop
control by repeating (unrolling) the loop body and
hence reducing the number of iterations, and to
replace calls on procedures by the body of the
procedure, with the necessary substitution for
parameters. Extra gains can come from the
interaction of the expanded code with the enclosing
code as in the following example:
-----------------------------
| for J = 1,1,100 cycle |
| A(J) = 0 |
| repeat |
-----------------------------
This can be expanded into:
-------------------------------
| for J = 2, 2, 100 cycle |
| A(J-1) = 0 |
| A(J) = 0 |
| repeat |
-------------------------------
and can generate the following code (PDP11):
--------------------
| CLR J |
| $1:ADD #2,J |
| MOV J,R1 |
| ADD R1,R1 |
| ADD LNB,R1 |
| CLR A-2(R1) |
| CLR A(R1) |
| CMP J,#100. |
| BNE $1 |
--------------------
3.1.4 Source optimisations
Source optimisations [Schneck, 1973] are those
optimisations which can be effected by changes in
the source program. They can be sub-divided into
three categories; machine-independent [Hecht, 1973;
Kildall, 1973], machine-dependent, and tentative.
Tentative optimisations are those which, while
unlikely to make the code worse, may improve it on
some machines. For example, most machines will
handle the comparison "X<1" better if it is
rewritten as "X<=0", where X is an integer variable.
With this knowledge the second statement can be
compiled to:
-------------------
| MOV# 1,1,SZR |
| JMP $1 |
| LDA 1,D |
| STA 1,A |
| $1: |
-------------------
Immediately before the label $1 it is known that
once again the value of A is in accumulator 1, and
that the STA above the label is marked as pending as
before. Following the definition of the label the
environment before the jump to that label can be
combined with the environment just before the label
to give the new environment following the label.
The information in this environment is that A is in
accumulator 1 and that the same store is pending
from both old environments. This allows the two
marked store instructions to be removed and one
store placed after the label (and once again marked
as being "pending"). This gives the following code:
---------------------
| LDA 0,B |
| LDA 1,C |
| AND 0,1 |
| MOV# 1,1,SZR |
| JMP $1 |
| LDA 1,D |
| $1:STA 1,A |
---------------------
A simple jump optimisation notices that the JMP
passes over just one instruction and can therefore
be removed by inverting the skip condition on the
previous MOVe, giving:
------------------
| LDA 0,B |
| LDA 1,C |
| AND 0,1 |
| MOVZL 1,1 |
| MOV# 1,1,SNR |
| LDA 1,D |
| STA 1,A |
------------------
Finally, peephole optimisation combines the AND with
the MOVZL giving ANDZL, and then combine this with
the following MOV# to give the complete code
sequence as:
------------------
| LDA 0,B |
| LDA 1,C |
| ANDZL 0,1,SNR |
| LDA 1,D |
| STA 1,A |
------------------
The most interesting thing to notice about this
particular sequence of optimisations is that with
the possible exception of the removal of the marked
STA instructions, the final code can be generated
very simply with local optimisations.
4 The design of the compiler
This section describes the features of the compiler which
have had an influence on the form of the intermediate code.
4.1 General structure
One of the aims of this type of compilation strategy is
to simplify the production of compilers, and a successful
technique for simplifying programs is to divide them into
several communicating modules, each largely independent of
the others but with well-defined interfaces between them.
At the highest level a compiler can be split up into three
major parts:
1 A language processor, which deals with the
language-dependent parts such as parsing,
semantic checking, and error reporting.
2 A code generator, which takes the decomposed
form of the program as generated by 1 above,
and constructs the appropriate code sequences
to perform the required functions.
3 An object-file generator, which builds an
object-file from the code sequences produced
by 2, In the form required by the system which
is to execute the program.
Commonly, the first two parts of this scheme are combined
into one program which generates as its output an
assembly-language source file corresponding to the original
program. The third part then becomes the standard system
assembler. This approach clearly simplifies the production
of the compiler, as one part, the assembler, is provided
already and can ease the problems of checking the compiler
because the code it generates is presented in a well-known
form. Despite these advantages it was rejected for the
following reasons:
1 In order that assembly language can be
generated, the compiler must have an internal
form of the instructions which is changed into
text, processed by the assembler, and finally
converted into the machine representation.
These transformations can be eliminated if the
compiler works directly with the machine
representations.
2 In general, the system-provided assembler will
be expecting to receive a much more powerful
language than the rather stereotyped text
produced by compilers. This will certainly
degrade the performance of the assembler. A
solution to this is to produce a cut-down
version of the assembler which only recognises
those constructs generated by the compiler.
However, producing a new assembler removes one
of the reasons for choosing this route,
namely, not requiring extra work in writing
the object-file generator.
3 As will be seen later (section 4,7), even
after the code sequences have been produced
there remain several optimisations which can
be performed using knowledge gained during the
production of those sequences, for example,
generating short forms of jump instructions
when the distance between the jump and its
destination is small enough. While in certain
cases these optimisations can be performed by
a standard assembler it is unlikely that the
structure of the code-generator would be as
simple as if a special-purpose object-file
generator were available.
The main interface in such a system is clearly that
between the language and machine dependencies, as most
languages are largely machine-independent. It is this
interface between the language-dependent and
machine-dependent parts of the compiler which is termed the
INTERMEDIATE CODE. In the following discussion it is
assumed that the reader has a reasonable understanding of
the structure of the final form of I-code, a definition of
which may be found in Appendix.A2.
code is to be generated by several different language
processors, a simple intermediate code which is easy to
produce may well be more attractive.
As I-code was intended for optimisation a high-level code
was chosen. In addition, as it was hoped that the code
could eventually be used in different language processors,
it was decided to keep the structure simple.
The complete compilation process may be thought of as a
sequence of transformations working from the source program
to the final object program via a number of intermediate
representations. As the transformations are applied the
representations become less dependent on the source language
and more dependent on the target machine. In order to
simplify the code-generator as much as possible the
intermediate code must lie as far from the source language
as is possible without straying from the objectives set out
below.
4.2.1 Objectives
One of the dangers in designing an intermediate code is
that of building into it old techniques and standard
expansions of source constructions, which while they may be
tried and tested cannot in any way be said to be "the only
solutions" or even "the best solutions".
One of the intentions behind the design of I-code was to
permit the use of varied implementation strategies. In the
same way that the only practical definition of a "good"
programming language is that it fits the style of the
particular programmer using it, so the measure of the power
of an intermediate code must include the ease with which it
can adapt to an existing style of code-generator writing.
Inevitably, practical constraints prevent total generality;
the most general form of a program is a canonical form of
itself, but this is little help in compiling it.
It follows that the intermediate code, while remaining true
to the original program and distant from "real" machines,
must provide enough simplification to make the task of
code-generation as easy as possible without inhibiting
optimisation.
From the start it was appreciated that an intermediate
code suitable for use in optimising compilers would
necessarily require more processing than a code such as
O-code which was aimed at a quick implementation. The
original hope was that although each machine-dependent code
generator would not be small, typically about 3000-4000 IMP
statements, large portions of one could be taken as a basis
for a new implementation. This has proved to be the case,
and provision of an existing code-generator as a template
greatly simplifies the task of creating a new one
(section 6.4.1).
For example, the following two program fragments are
semantically identical:
A B
----------------------------------------------------
| | P = 0 |
| | cycle |
| TEST for P = 1, 1, 10 | P = P+1 |
| | TEST |
| | repeat until P = 10 |
| | |
----------------------------------------------------
However, in "B" the information that the fragment contains a
simple for construction, while not completely lost, has been
scattered through the code, and this dilution of information
will increase the complexity of any code-generator wishing
to handle for loops specially.
To leave open all avenues for optimisation it is
necessary therefore, that all of the semantic information in
the source program is preserved in a compact form in the
I-code. One sure way of achieving this property is to
design the code in such a way as to allow the regeneration
of the source program, or at least a canonical form of it
which is not significantly different from the original. In
this context insignificant differences are the removal of
comments and the standardisation of variant forms of
statements, such as:
NEWLINE if COUNT = 0
and: if COUNT = 0 then NEWLINE
frequently pun on addresses and integer values. Several
later versions of the codes have attempted to solve these
problems by parameterising the intermediate code generator
so that the characteristics of the target machine may be
used to modify the code which is produced. However, they
still have built in to them assumptions about how the
objects can be addressed.
There are so many constraints which can be imposed on the
code to be generated, such as operating system requirements
and conventions for communicating with the run-time
environment, that a parameterised first phase could not be
expected to generate code which was well-suited to every
installation.
The authors of JANUS [Coleman, 1974] write that they believe
that the approach of using a parameterised intermediate code
"... is a dead end, and that the adaptability must come in
the translation from the intermediate language to machine
code".
4.2.1.4 Simplification
For the complexity of the machine-dependent phases of
compilation to be kept as low as possible the
machine-independent phase must do as much work as possible
while keeping within the constraints imposed by the previous
objectives. One way of simplifying the intermediate code is
for certain high-level constructions to be expanded into
lower-level constructions, but only when there is just one
expansion possible under the rules of the language, and that
expansion does not scatter information which may be of later
use.
The most obvious case of such expansion is in dealing with
complex conditional clauses such as:
-----------------------------------------------------
| if (A=B and C#D) or (E<F and G>H) then X else Y |
-----------------------------------------------------
IMP-77 specifies that the condition will only be evaluated
as far as is necessary to determine the inevitable truth or
falsity of the condition, and so, bearing in mind the
modifications to be discussed in section 4.6.1, the
statement can be represented more simply as:
---------------------------------
| if A # B then ->L1 |
| if C # D then ->L2 |
| L1: if E >= F then ->L3 |
| if G <= H then ->L3 |
| L2: X |
| ->L4 |
| L3: Y |
| L4: |
---------------------------------
This expansion is tricky and notoriously error prone, and
therefore is best done once and for all in the common phase.
Similarly it is possible to expand all simple control
structures into their equivalent labels and jumps, providing
that the structural information is not lost thereby.
structuring of a program significant gains in paging
behaviour can be obtained and so this option should not be
pre-empted by the intermediate-code (as does Z-CODE which
automatically reorders the definitions of procedures).
The possibility of automatic improvement of paging
behaviour was investigated by Pavelin who showed that the
paging characteristics of a program can be improved by an
automatic reordering of the code [Pavelin, 1970].
Pavelin's thesis describes the breaking-up of a program into
"chunks", defined by branches and the destinations of
branches. At each chunk boundary extra instructions are
planted to cause the updating of a "similarity array" which
records the dynamic characteristics of the program. After
several runs the similarity arrays are merged and the result
is used to specify a reordering of the chunks which should
improve the paging performance. In test cases the
working-set size of the code was reduced by as much as 40%.
The thesis also went on to say that the various compilation
problems associated with this "... can be alleviated by
operating on an intermediate code which is machine
independent with symbolic code addresses".
For example, on the INTERDATA 7/16 a procedure at the
outermost textual level uses register 15 to access its local
stack frame, giving the exit sequence:
-------------------
| LM 7, 4(15) |
| BFCR 0, 8 |
-------------------
but a procedure nested within this would use register 14
thus:
-------------------
| LM 7, 4(14) |
| BFCR 0, 8 |
-------------------
It follows that the signal routine must be told which
base regiser to use at each stage of the recovery. This can
be done either by planting code in the entry and exit
sequences of each procedure, or by keeping a static table
associating procedure start and finish addresses with the
appropriate base register.
The first method is poor as it imposes a run-time overhead
on all procedures, whether they trap events or not. The
second method is better but can be complicated if procedures
are nested as the start-finish addresses alone no longer
uniquely define the procedure. One solution is to cause all
procedures which use the same exit sequence to be loaded
into distinct areas, and to associate the recovery
information for the signal routine with each area. This
reduces the static overhead to a few words per area, rather
than a few words per procedure.
ii Array references with constant subscripts need no
address calculations at run-time. For example
using V as declared above, the element V(2) is
immediately addressable as the halfword with
displacement "a+2 + 2*2" from LNB.
iii In certain more general cases when the subscript is
a variable, access can be simplified by remembering
previous calculations. For example, the address of
the array element V(X) is
-------------------------------------
| addr(V(0)) + X*size of each element |
-------------------------------------
In the example above this becomes:
-------------------------------------
| LNB+a+2 + X*size of each element |
-------------------------------------
which can be rearranged to:
-------------------------------------------
| a+2 + (X*size of each element+LNB) |
-------------------------------------------
Hence the following code could be produced (7/16):
--------------------
| V(X) = 0 |
| |
| LH 1,X(LNB) | pick up X
| AHR 1,1 | *2 (2 bytes per integer)
| AHR 1,LNB | add in LNB
| SHR 0,0 | get zero
| STH 0,a+2(1) | store in V(X)
--------------------
Noting that the value now in register 1
(X*size+LNB) only depends on the size of each
element, X, and the local name base, it is clear
that register 1 can be used to address the X'th
element of any integer array of one dimension and
constant bounds declared at the current level.
Hence if the array W(1:12) were declared
immediately after Y in the example above, while
register 1 is not changed W(X) can be addressed as
a+2002(1).
On the other hand, a machine with limited store cover,
such as the Data General NOVA which only has an eight-bit
displacement, will almost certainly force the array to be
implemented as an immediately addressable pointer which is
initialised at run-time to the address of storage claimed
explicitly.
--------------------------------------------------------
| LNB |
| | |
| | |
| v .---.---.---. |
| .- -| X | V | Y | - - - |
| .---.---.---. |
| | |
| | |
| | .------.------. .--------. |
| +->| V(0) | V(1) | - - - - | V(999) | - - - |
| .------.------. .--------. |
--------------------------------------------------------
With this organisation the address of V(X) will be:
----------------------------
| V + X*size of each element |
----------------------------
and there is little that can be done by rearranging the
expression to improve on the "obvious" code (7/16):
----------------
| V(X) = 0 |
| |
| LH 1,X | pick up X
| AHR 1,1 | double it
| AH 1,V | add in addr(v(0))
| SHR 0,0 |
| STH 0,0(1) |
----------------
Not only is this second code sequence longer than the first
by two bytes, but it will execute more slowly as the second
addition involves a store reference whereas the equivalent
instruction in the first sequence uses a register.
In both cases, however, some simplification can be done if
the subscript is an expression of the form:
------------------------------
| X plus or minus CONSTANT |
------------------------------
in which case the constant can be removed from the subscript
expression evaluation and added into the final displacement.
For example (7/16):
------------------------
| V(X-7) = 0 |
| |
| LH 1,X | pick up X
| AHR 1,1 | double it
| AHR 1,LNB | and in LNB
| SHR 0,0 | get zero
| STH 0,a+2+(-7)*2(1) |
------------------------
Unfortunately even this optimisation may not be
available. For example, the ICL 2900 series performs array
accesses through a DESCRIPTOR REGISTER, and the extra
displacement cannot be added into the instruction. Also
some machines, such as the IBM 360, only permit positive
displacements in instructions.
The examples above pose the following problem: If the
intermediate-code is to know nothing of the target machine
it cannot know the best way to declare the array, nor the
best way to access it. Therefore the code must always
produce the same sequences for array declarations and array
accesses. It follows that these sequences must remain quite
close to the original source and not include any explicit
address calculations.
As another example, the DEC PDP11 range has a hardware
stack which grows with decreasing store addresses. Because
of this it could be convenient to allocate storage for
variables in that order, from large addresses to small
addresses. However, in several cases it may be necessary to
force objects to be created in increasing addresses, such as
when program structures are to be mapped onto
hardware-defined structures in memory, resulting in an
implementation which requires to be able to create similar
objects in different ways depending on the context.
Finally, some machines provide instructions in which the
displacement of the operand is scaled before use, depending
on the size of that operand. The GEC 4080 is such a
machine, with instructions such as:
LDB 1 load byte <1>
LD 1 load halfword, bytes <2> & <3>
LDW 1 load fullword, bytes <4>,<5>,<6> & <7>
When producing code for such machines it is convenient to
allocate all the local objects of the same size in
particular areas, and then order the areas in increasing
order of the size of the objects they contain. This permits
fuller use of the available displacement field in the
instructions.
The solution to these problems which was chosen in I-code
was to define a DESCRIPTOR for each object to be
manipulated. On input to the code-generator descriptors are
converted from their machine-independent form to a new form
appropriate to the target machine. As all subsequent
reference to the object will be through descriptors the code
produced will automatically reflect the decisions made at
the time the descriptors were created.
As has been discussed previously, it may be possible to
remove the overhead in setting up addressability for local
variables and parameters if the parameters can be held in
registers and the local variables are never referenced.
After examining many procedures which do use local variables
it is clear that a large number of them do not need the
complete overhead in setting up a local frame base as they
could use the workspace pointer (stack pointer) instead.
The criterion is that the position of the locals relative to
the workspace pointer must be known at compile time. This
reduces to the procedure not having any objects with
computed sizes (arrays with computed bounds, for example)
and no calls on procedures which access the locals as their
global variables.
Consider the compilation of the following procedure on the
PDP11:
----------------------------------------
| routine MARK(record(cellfm)name CHAIN) |
| integer N |
| N = 0 |
| while not CHAIN == NULL cycle |
| N = N+1 |
| CHAIN_INDEX = N |
| CHAIN == CHAIN_LINK |
| repeat |
| end |
----------------------------------------
The code normally produced for this routine would be:
--------------------------
| MOV LNB,-(SP) | remember old LNB
| MOV DS,-(SP) | remember DS
| MOV R0,(DS)+ | save the parameter
| MOV DS,LNB | set up local addressing
| ADD #20,DS | reserve local space
| CLR 10(LNB) | N = 0
| $1: MOV -2(LNB),R1 | test CHAIN
| BEQ $2 | branch if NULL
| INC 10(LNB) | N = N+1
| MOV 10(LNB),2(R1) | CHAIN_INDEX = N
| MOV (R1),-2(LNB) | CHAIN == CHAIN_LINK
| BR $1 | repeat
| $2: MOV (SP)+,DS | restore DS
| MOV (SP)+,LNB | restore LNB
| RTS PC | return
--------------------------
however, by using workspace pointer (DS) relative addressing
this reduces to:
--------------------------
| MOV R0,(DS)+ |
| TST (DS)+ | reserve local space
| CLR -2(DS) | N = 0
| $1: MOV -4(DS),R1 | test CHAIN
| BEQ $2 |
| INC -2(DS) | N = N+1
| MOV -2(DS),2(R1) | CHAIN_INDEX = N
| MOV (R1),-4(DS) | CHAIN == CHAIN_LINK
| BR $1 |
| $2: SUB #4,DS | restore DS
| RTS PC | return
--------------------------
This optimisation can be performed quite simply by the
third phase of compilation.
In the interface between the second and third phases the
code sequences generated by the second phase are made up of
items of the form:
<type> <VALUE>
where <type> describes where <VALUE> is to be put, for
example in the code area or in the private data area. To
achieve the workspace-pointer-relative addressing extra
types are introduced which specify that the associated value
is the displacement of a local variable from LMB. Other
codes are needed to be able to modify the operation part of
the instruction which uses the displacements but these will
be ignored here as they cause no difficulty and would just
obscure the discussion. In addition, an extra <modify DS>
item is output whenever DS is explicitly altered (as when
parameters are stacked using MOV ??,(DS)+. By default the
third phase will treat these extra types as being exactly
equivalent to <code area> types, and will generate the first
sequence of code. However, if when the end of the procedure
is processed, the second phase discovers that no dynamic
objects or dangerous procedure calls were generated, it
marks the end of the procedure accordingly (in the same way
as described in section 4.7.2). This mark instructs the
third phase to relocate all values with the appropriate type
so as to make them relative to DS. The <modify DS> types
are used to keep the third phase's idea of the current
position of DS in step with reality.
4.5 Procedure entry and exit.
IMP is heavily based on the use of procedures, indeed the
only method of communicating with the controlling
environment is by means of procedure calls. Also the
techniques of structured programming result in the extensive
use of procedures. Clearly when writing a compiler for such
languages much thought must be given to making procedure
entry and exit (and the associated passing of parameters) as
efficient as possible.
4.5.1 User-defined procedures
The usual technique for procedures entry and exit is to
have standard preludes and postludes which cover all the
different types of procedure. For example the EMAS IMP code
sequences [Stephens, 1974] are (ICL4/75):
-------------------------
| STM 4,14,16(WSP) | save the current environment
| BAL 15, PROC | enter the procedure
| . |
| PROC ST 15,60(WSP) | save the return address
| LR LNB,WSP | set up the local stack frame
| LA WSP,***(WSP) | claim local stack space
| BALR 10,0 | set up code addressability
| . |
| . |
| LM 4,15,16(LNB) | restore calling environment
| BCR 15,15 | return
-------------------------
While this has proved to be convenient to generate and
efficient to execute it has one major problem, part of the
housekeeping of the procedure entry is performed at the call
itself. This seems undesirable for two reasons:
i Procedures are generally called more often than
they are defined. If part of the housekeeping of
procedure entry is done at the call that code will
be duplicated at each call, thus increasing the
size of the program. Putting that code within the
procedure reduces the size overhead.
ii If the knowledge of what housekeeping needs to be
done for procedure entry is needed outside the
procedure it becomes impossible to alter the entry
and exist sequences to suit the actual procedure.
In particular on certain machines it is possible to
remove the entry and exit sequences altogether when
the procedures are simple enough.
If the 4/75 compiler moved the environment-saving STM
instruction into the body of the procedure the storing of
the return address would be performed automatically:
-------------------------
| BAL 15,PROC |
|- - |
| PROC STM 4,15,16(WSP) |
| LR 8,WSP |
| .. .. |
-------------------------
This not only saves four bytes per call, very important on a
machine with a very severely limited immediate addressing
range, but also reduces the overhead in entering the
procedure by one instruction.
A further modification would be to pass one or more of the
parameters in the registers, leaving the way open for
remembering that fact inside the procedure.
Hence a call could be reduced from:
------------------------
| L 1,X |
| ST 1,64(WSP) |
| L 2,Y |
| ST 2,68(WSP) |
| BAL 15,PROC | PROC(X, Y)
|- -|
| PROC STM 4,15,16(WSP) |
| .. .. |
------------------------
to:
------------------------
| L 0,X |
| L 1,Y |
| BAL 15,PROC |
|- -|
| PROC STM 4,1,16(WSP) |
| .. .. |
------------------------
The ability to determine exactly how parameters are to be
passed can be of crucial importance in the efficiency of the
procedure mechanism.
When compiling for the PDP11 the obvious calling sequence
for a procedure with two integer value parameters would be:
-------------------
| MOV X,-(SP) |
| MOV Y,-(SP) |
| JSR PC,PROC |
-------------------
Unfortunately this produces problems inside the procedure as
the return address, stacked by JSR, is too far down the
stack to permit the use of the RTS instruction to return,
for this would leave on the stack the space used by the
parameters. Neither can the stack be adjusted before the
return which would be made indirect through a location
beyond the stack pointer as space there must be considered
volatile, being used by interrupt handling. Extra
instructions are needed either at the call or inside the
procedure to adjust the stack; the JSR instruction may well
not be "a beauty" as claimed by some implementors [Bron,
1976]. A MARK instruction has been introduced in an attempt
to overcome this problem, but it is far from helpful as it
imposes an arbitrary register convention and puts all of the
overhead on the call rather than on the procedure itself.
On the other hand if all of the parameters can be passed in
registers, the JSR will put the return address on a clear
stack permitting the use of RTS for the return. As in
practice most procedures have few parameters, usually only
one or two, this can give a large saving.
As an example of the power of being able to alter entry and
exit sequences consider a recursive implementation of the
IMP routine SPACES:
-----------------------------
| routine SPACES(integer N) |
| return if N <= 0 |
| SPACES(N-1) |
| SPACE |
| end |
-----------------------------
On the PDP10 the straightforward coding for this would be:
--------------------------
| MOVE 0, X | pick up X
| MOVEM 0, 3(SP) | assign the parameter
| PUSHJ SP, SPACES | call SPACES
|- -|
| SPACES: MOVEM LNB,1(SP) | save old frame base
| MOVE LNB,SP | pick up new frame base
| ADDI SP,3 | reserve new frame base
| SKIPLE 1,2(LNB) | load, test & skip if X<=0
| JRST LAB1 | jump to LAB1
| MOVE SP,LNB | restore stack pointer
| POPJ SP | return
| LAB1: SOJ 1, 0 | X-1 -> ACC1
| MOVEM 1,3(SP) | assign parameter
| PUSHJ SP,SPACES | call SPACES
| PUSHJ SP,SPACE | call SPACE
| MOVE SP,LNB | restore stack pointer
| MOVE LNB,1(SP) | restore old frame base
| POPJ SP |
--------------------------
By applying the optimisations of passing the parameter in an
accumulator (called ARG) and remembering that the parameter
is in this accumulator on entry to the procedure, the code
reduces to:
---------------------------
| MOVE ARG,X | pick up X
| PUSHJ SP, SPACES | call SPACES
|- -|
| SPACES: MOVEM LNB,1(SP) |
| MOVE ARG,2(SP) | assign the parameter
| ADDI SP,3 |
| JUMPG ARG, LAB1 | ->LAB1 if ARG > 0
| MOVE SP,LNB | restore stack pointer
| MOVE LNB, 1(SP) |
| POPJ SP | return
| LAB1: SOJ ARG, 0 | parameter = ARG-1
| PUSHJ SP,SPACES |
| PUSHJ SP,SPACE |
| MOVE SP,LNB |
| MOVE LNB,1(SP) |
| POPJ SP |
---------------------------
On inspection it is clear that the local stack frame
(pointed at by LNB) is never used within the procedure
except by the entry and exit sequences. Hence by reducing
those sequences to the absolute minimum the code becomes
---------------------------
| MOVE ARG,X |
| PUSHJ SP, SPACES |
|- -|
| SPACES: JUMPG ARG, LAB1 |
| POPJ SP |
| LAB1: SOJ ARG, 0 |
| PUSHJ SP,SPACES |
| PUSHJ SP,SPACE |
| POPJ SP |
---------------------------
Finally, an opportunistic optimisation may be performed
[Knuth, 1974; Spier, 1976] by noticing that the final two
instructions may be combined so that the procedure SPACE
uses the return address pushed onto the stack for the return
from SPACES. This results in the tightest form of the code:
----------------------------
| MOVE ARG,X |
| PUSHJ SP, SPACES |
|- -|
| SPACES: JUMPG ARG, LAB1 |
| POPJ SP |
| LAB1: SOJ ARG, 0 |
| PUSHJ SP,SPACES |
| JRST SPACE |
---------------------------
The final steps in this optimisation can only be performed
once the body of the procedure has been compiled. In order
that the correct (in this case non-existent) entry sequence
can be used an extra pass over the object code is necessary.
This pass can be combined with the process of adjusting
labels and jumps which is carried out in the third phase of
compilation described in section 4.7. The code generator
can mark the position where an extra sequence is required
and at the end of the procedure can inform the third phase
of any salient features found in the body. The third phase
can then decide on the best entry and exit sequences to use.
This ability to tailor the "housekeeping" parts of
procedures can be used in many circumstances to limit the
inclusion of code which is needed to handle rare
constructions to those procedures which use the feature.
As an example of this consider the ICL 2900 series.
The machines of the series are designed around a hardware
stack, which resides in one, and only one, segment of the
user's virtual memory, and thus limits this data space to
255K bytes. In order to be able to handle programs using
very large arrays space must be available off-stack in
another segment or set of consecutive segments. The
maintenance of this extra data space will require
instructions to be executed on entry to and on exit from
procedures which claim space from it, but not from those
which only use space from the stack. These extra
instructions can be added to the procedure in a simple
manner by the third phase as it now controls the form of the
procedure when all the necessary information is available.
For these optimisations to be performed the intermediate
code must not lay down rules for procedure entry and exit,
rather it should simply mark the points at which suitable
code is required.
An additional consideration in the design of the I-code
for procedure' entry and exit is the requirement of some
machines for a "pre-call" to be made the prepare a hardware
stack for parameters prior to their evaluation and
assignment.
For example (ICL2900):
-------------------
| PROC(1, 2, 3) |
|- -|
| PRCL 4 | pre-call
| LSS 1 | load 1
| SLSS 2 | stack it and load 2
| SLSS 3 | stack it and load 3
| ST TOS | store it on Top Of Stack
| RALN 8 | raise the Local Name Base
| | to point to the new frame
| CALL PROC |
-------------------
Following these considerations the form of procedure call
chosen for I-code was:
-----------------
| PROC P | stack procedure descriptor
| | \
| {stack param} | : repeated for each parameter
| ASSPAR | /
| ENTER | enter the procedure
-----------------
ASSPAR causes the value described on the top of the stack to
be assigned to the next parameter, identified by the
procedure descriptor second on the stack, using either
ASSVAL or ASSREF as appropriate.
In order to pass some of the parameters in registers all
that need be done is for the initial processing of the
descriptors for those parameters to define then as the
appropriate registers. PROC can then "claim" those
registers, the parameter assignment will load them, and
finally ENTER can release them for subsequent re-use on
return from the procedure.
which were chosen because they typify the main problems in
this area.
ITOS is a fairly complicated string function which
returns as its result the decimal character string
representation of the integer value passed to it as a
parameter. Because of its complexity this procedure is
almost always best implemented as an external procedure
which is linked into the program along with any other
external entities required.
REM is an integer function which returns the remainder of
dividing the first integer parameter by the second, and on
many machines can be efficiently compiled in-line, as most
integer divide instructions provide both the quotient and
the remainder. However, when compiling for machines such as
the DATA GENERAL NOVA or the DEC PDP11 when they do not have
the optional divide instructions, division has to be
performed by a complicated subroutine, suggesting that REM
itself should be an external procedure like ITOS.
READSYMBOL falls somewhere between the two, mainly
because it is defined to have a general name parameter, that
is, the parameter may be a reference to any type of entity:
integer, real, byteinteger, etc. To implement READSYMBOL as
an external procedure it would have to be passed the general
name parameter (comprising both the address of the actual
parameter and information about its type and precision), and
would have to interpret that parameter in order to be able
to store the character, suitably converted, in the
appropriate way.
A much more efficient implementation is to convert the
statement:
---------------
| READSYMBOL(S) |
---------------
into the equivalent form:
------------------
| S = F$READSYMBOL |
------------------
where F$READSYMBOL is a function which returns as its result
the character value that READSYMBOL would have placed into
its parameter. Once this is done conversions and the choice
of store operation can be left to the usual assignment part
of the compiler.
A further complication can arise if, as in the case of
the INTERDATA 7/16 operating system, ISYS [Dewar, 1975],
several permanent procedures map directly onto
system-provided facilities: the function F$READSYMBOL can be
replaced by the supervisor call "SVC 8,0", SELECT INPUT by
"SVC 6" etc.
The difficulty caused by READ is mainly one of space. As
read can input an integer value, a real value, or a string
value depending on the type of its (general name type)
argument, it is going to be fairly large, especially if the
hardware on which it runs does not provide floating-point
instructions, forcing those functions to be performed by
subroutine. It follows that on small systems it may be
convenient to replace calls on READ by calls on smaller
procedures, chosen at compile-time by examining the type of
the parameter given to READ, which input solely integer,
real, or string values.
Finally it should be noted that the substitutions and
modifications discussed above may only be generated as
replacements for direct calls on the procedure; if the
procedure is passed as a parameter to another procedure no
alterations are possible and a "pure" version must be
available. As passing a procedure as a parameter is totally
distinct from calling the procedure this case does not
prevent the improvements being carried out where possible.
It should now be clear that the efficient implementation
of permanent procedures will differ greatly from the
implementation of user-defined procedures, and the
implementation of permanent procedures on different
machines. Hence the intermediate-code must make no
assumptions about either which permanent procedures are
available or how they are to be implemented.
As a side-effect of removing any built-in properties from
permanent procedures it becomes possible for a simple
code-generator to ignore any possibility of producing
special code and compile them all as externals.
These transformations of procedures can only be applied
when the procedures are invoked (called) directly� In the
case of procedures passed as parameters all calls will of
necessity be the same and hence either it will not be
possible to pass some permanent procedures as parameters, an
unfortunate limitation imposed by several languages, or
there must be a "pure" form of the procedures available.
This latter can be done very simply using I-code. The
primitive procedure descriptors are defined exactly as if
the procedures were truly external, but with an extra marker
showing them to be "permanent". The only time that this
marker is used is in the procedure call generating section
of the compiler. If the procedure is being passed as a
parameter this section of the compiler is not entered and so
the procedure will be passed as an external. All that is
now necessary is for there to be an external manifestation
available when the program executes. This method has the
added advantage that there is no compile-time overhead,
especially important considering that passing procedures as
parameters is one of the least-used features of IMP77.
4.5.4 Primitive Procedures
It is rare for machines to provide simple instructions
which can deal directly with all of the requirements of
high-level languages and so several constructions will have
to be handled by subroutines. The code generator may then
refer to these "primitive procedures" as though they were
machine instructions. The cases in which such procedures
are required commonly include exponentiation, string
manipulation, and array declaration and access.
Given these procedures the code-generator has a choice
between calling them as closed subroutines or expanding them
in-line. The former produces dense code but will execute
more slowly than the latter (and possibly suffer from not
knowing what is corrupted by the routine and therefore
having to forget everything it knows). On the other hand
while the expansion of primitive procedures in-line will
improve the execution speed of the program, it becomes
necessary for the code-generator to be able to create the
appropriate code sequences and thereby become more bulky.
Once again the choice must be left to the code-generator as
the benefits of a particular decision will depend on both
the target machine and the use to which the compiler is to
be put. If the compiler is to be used for large
mathematical problems it is likely that the gains made by
putting exponentiation in-line will outweigh the
disadvantage of the extra code size, whereas in
operating-system work as exponentiation is probably never
needed, the extra complexity of the code generator to expand
the routine would not be desirable.
Given that some of the primitive procedures will be
referenced often (checked array access, for example) it is
important that entry to them is made as efficient as
possible and in this area the ability to reorder code can be
used to great effect.
In the original Interdata 7/32 IMP77 compiler the
primitive routines were gathered together at the end of the
user's code, as it was only then that it was known which
procedures were required.
---------
| | <- CODE BASE (register 15)
| USER |
| CODE |
| |
---------
| |
| PRIM |
| PROCS |
| |
---------
With this scheme programs of 16Kbytes or less can reference
the primitive procedures with 32-bit instructions
(program-counter relative addressing). Unfortunately once
the program grew beyond this limit the larger and slower
48-bit form of the instructions had to be used in order to
achieve addressability. In the IMP77 code generator there
were 352 such large instructions.
In the new compiler the object code is reordered to place
the primitive procedures at the head of the user's code
where they can be addressed relative to CODE BASE.
---------
| | <- CODE BASE (register 15)
| PRIM |
| PROCS |
| |
---------
| |
| USER |
| CODE |
| |
---------
The immediate disadvantage of this is that it will push the
user's procedures further away from CODE BASE and hence
increase the chances of a user procedure reference requiring
a long (48-bit) instruction. However in practice this is
not a problem as the total size of the primitive procedures
is usually quite small, typically less than 800 bytes on the
7/32. The IMP77 code generator mentioned above now needs no
long references at all, saving 724 bytes of code, out of
about 40Kbytes. The compression of the code so achieved can
be enhanced slightly by bringing the destinations of more
jumps into the short-jump range, giving an extra saving of
20 bytes the case above. In addition, now that a register
(CODE BASE) is pointing to the first primitive procedure the
list of procedures required can be reordered to place the
most frequently referenced one first and thereby reduce
references to it to 16-bit instructions
(BALR LINK,CODEBASE).
When compiling with checks on, by far the most commonly
referenced primitive procedure is the routine which checks
for the use of an unassigned variable (over 2000 references
to it in the code generator), and this trivial optimisation
results in a saving of more than 4000 bytes.
However, if it is known that the label $1 will only ever be
used once, the code-generator may remember that the current
value of the variable X will still be in register 1
following the label, and thus remove the need for it to be
loaded again if it is required before register 1 gets
altered. In the case of user-defined labels no statement
can be made about the number of uses of each label without a
complete analysis of the parts of the program where the
label is in scope.
This suggests that I-code should maintain a clear
distinction between user-defined and compiler-generated
labels. Also, by making the rule that compiler-generated
labels may only be used once, the internal representations
of labels may be reused by the code-generator, removing the
necessity for large tables of label definitions in this
phase of compilation.
This now leaves the question of how to represent
conditional jumps in the intermediate code. The first
observation is that user-specified jumps need never be
conditional as they can always be surrounded by appropriate
compiler-generated conditional jumps. This can be used to
restrict the processing of conditions and tests to the
compiler-generated jumps. The second observation is that in
IMP77 conditionals are always associated with the comparison
of two values or the testing of an implied boolean variable
(predicates and string resolution).
There are currently three main ways in which processors
handle this:
1 "compare" instructions are used to set flags or
condition-codes which represent the relationship
between two values (one of which is frequently an
implied value of zero). These condition-codes are
later used to control the execution of conditional
branch instructions. This method is used in the
PDP11: COMP, BNE etc.).
2 Instructions are provided which compare two values
as above but instead of setting condition-codes
they skip one or more subsequent instructions
depending on a specified relationship. By skipping
unconditional branches in this way conditional
branch sequences may be generated. This method is
used in the PDP10: SKIPE etc.
3 Instructions are provided which compare two values
and branch to a specified label if a given
relationship holds. This method is used in the
PDP10: JUMPNE etc.
P-code uses compare instructions to set the boolean value
TRUE or FALSE on the stack and then uses this value either
as an operand in an expression or to condition a branch (a
variant of technique 1 above).
Z-code tests the value in a register against zero and
branches accordingly (technique 3 above).
These three techniques have fairly obvious possible
representations in I-code:
if X = Y start
1) PUSH X
PUSH Y
COMP {set condition code}
BNE 1 {branch not equal}
2) PUSH X
PUSH Y
SKIPE {compare and skip if equal}
GOTO 1
3) PUSH X
PUSH Y
JUMP # 1 {compare and branch if not equal}
All three of these representations have been tried in
different versions of I-code.
Technique 2) was rejected as it proved cumbersome to
implement effectively, especially on machines which did not
use skips; either the code-generator had to "look ahead" to
be able to locate the destination of the skip (which is
dependent on the instruction being skipped) or to check
before each instruction whether on not a skip had been
processed earlier and its destination had not yet been
resolved,
Technique 1) was perfect for machines with condition-codes
but required look-ahead over subsequent jumps on machines
which used skips.
Both 1) and 2) had the additional problem that to generate
conditional branches two separate I-code instructions had to
be given. In the case of 1) condition-codes are usually
altered by many instructions not directly involved in
comparison and hence the compare and its associated branch
must be made adjacent. With 2) there is the possibility of
generating meaningless constructions such as skipping a
line-number definition instruction. These difficulties add
complexity to the definition of the intermediate code and
require extra checks in the code generator.
Thus the third form was chosen as the most convenient,
even though all three forms can be suitable defined to be
totally equivalent. In particular the third technique
provides all the relevant information to the code-generator
in one instruction, and has proved to be simple and
effective as a basis for generating code for both
condition-code and skip sequences.
Temporary variables may also be required to implement
certain high-level constructions, such as the IMP for
statement:
-----------------------
| for V = A, B, C cycle |
-----------------------
which is defined so that the initial values of B and C, and
the initial address of the control variable, V, are to be
used to control the loop regardless of any assignments to V,
B and C. While it is possible for a machine-independent
optimiser to discover whether these variables are modified
in the loop or not, in the simple case where little
optimisation is required the code generator must use
temporaries.
In the case of expression evaluation, however, the machine
independent phase cannot know how many temporaries will be
required. Even giving the first phase knowledge of the
number of registers available is not adequate for several
reasons. Firstly, the use of registers is commonly tied to
the operations being performed, as in the case of integer
multiply on several machines which requires a pair of
registers. For a machine-independent first phase to be able
to cope with this sort of limitation would require great
flexibility of parameterisation.
Secondly, the first phase would have to be given details of
the problems encountered in statements such as:
----------------------------
| LEFT = REM(A,5) + REM(B,7) |
----------------------------
On a PDP11 equipped with the EIS option a divide instruction
is available which provides both the quotient and the
remainder. Hence the statement could be compiled into:
-----------------
| MOV A,R1 |
| SXT R0 | propagate the sign of A
| DIV R0,#5 | remainder to R1
| MOV B,R3 |
| SXT R2 |
| DIV R2,#7 | remainder to R3
| ADD R2,R0 |
| MOV R0,LEFT |
-----------------
In this case no temporary store locations are required.
However, if the EIS option is not present no DIV instruction
is present and so a subroutine must be used instead. The
code becomes:
-----------------
| MOV A,R1 |
| MOV #5,R2 |
| JSR PC,DIV | result back in R1
| MOV R1,T1 | preserve remainder
| MOV B,R1 |
| MOV #7,R2 |
| JSR PC,DIV | result in R1
| ADD T1,R1 |
| MOV R0,LEFT |
-----------------
As the subroutine REM uses R1 (for one of its arguments and
to return its result) the result of the first call on REM
must be saved in a temporary, T1. Of course, the function
REM could be written so as to preserve the value in, say, R2
and this could be used instead of T1, but this would
increase the cost of REM when it is likely that the value in
R2 will not be of use as most expressions are trivial
[Knuth, 1971].
Unless the machine-independent phase is given intimate
knowledge of the target machine (something of a
contradiction) it cannot know how many temporaries to use
nor when to use them.
The solution adopted by most intermediate codes is to base
the code around a stack, thus providing an unlimited number
of temporaries which are handled automatically. While this
in itself does not hinder the compilation for a machine
without a hardware stack, as the code-generator can always
simulate the stack internally, its presence invariably
results in other parts of the code using it, for example to
pass parameters to procedures where the receiving procedure
contains built-in knowledge of the layout of the stack.
As a stack does not require the explicit mention of
temporaries it has been adopted by I-code, but purely as a
descriptive mechanism. Because I-code does not specify the
computation but the compilation process needed to produce a
program which will perform the computation, this internal
stack need have no existence when the final program
executes.
The implementors of SIMPL-T describe an intermediate code
with some properties similar to I-code, but based on
"quadruples" of operators and operands rather than an
internal stack [Basili, 1975]. The stack approach was
rejected by them because "quads allow more flexibility in
the design of the code generator since, for example, no
stack is required". The exact meaning of this is not clear
but it suggests the misconception that a stack-based
intermediate code forces a stack-based object code
representation. Regardless of the exact structure of the
code generator or the input it takes, some form of internal
stack is invariably required for operations such as
protecting intermediate values in registers which are needed
for other purposes, and it seems reasonable to make this
stack more explicit if so doing will simplify the
intermediate code and its processing.
The third phase takes as its input two data streams
generated by the second phase. These streams are:
i The object stream, a sequence of items of the form:
<type> <value>* defining the code sequences
required in the object file.
ii the directive stream, a sequence of items defining
the logical structure of the object stream, that is
a specification of label definitions and label
references, and details of various code groupings
(blocks, procedures etc.).
The third phase starts by taking in the directive stream and
constructing a linkage map describing the whole program.
This linkage map is processed and then used to control the
generation of the final object file from the object stream.
The operations performed using the map are:
4.7.1 Reordering
As discussed previously, there are several gains to be
made by having the ability to output instructions in an
order different from that in which they were implied by the
linear structure of the source program. This reordering is
performed on the linkage map in a manner controlled by items
in the directive stream. In the most simple case of
exbedding procedures (section 4.3.1) this only entails
allocating code addresses to the items in the map each time
an "end-of-block" control item is input, resulting in the
procedures being laid out in "end" order.
To facilitate evaluating references to the reordered areas,
all references in the object stream are made relative to the
start of the appropriate area.
As this process does not cause the physical moving of the
various areas there is an implicit assumption that either
the subsequent processing of the object stream can do the
reordering (for example by writing its output to specific
sections of a direct-access file), or that the object file
format can instruct the loader or linker to do the
shuffling.
With the linkage map available it becomes possible to
make a preliminary pass over the object stream performing
structural modifications which require knowledge of the
generated code and which alter its size and general
appearance. These modifications may be made by passing the
object stream through a buffer which is scanned and modified
under the control of the linkage table. In this way merging
common code sequences and reordering the arms of conditional
sequences may be achieved quite simply.
In typical programs the frequency of occurrence
of such jumps is:
PDP11 PE3200
------------------
2 byte | 88% 28% |
4 byte | 8% 71% |
6 byte | 2% <1% |
------------------
It has been suggested [Brown, 1977] that the
problem of deciding which form of jump to use can
be eased on certain machines by specifying a
"distance" parameter with the intermediate code,
e.g. "GOTO LAB,80" informing the code generator
that the label LAB is 80 instructions ahead.
It is difficult to think of any case in which this
could be of any use as it requires the code
generator to be able to predict the amount of
target machine-code which will be generated for
each intermediate code instruction.
The solution adopted by the IMP compilers has
been for the code generator to assume that all
jumps are the minimum size, and to let the third
phase stretch them where necessary.
The Perkin-Elmer CAL assembler [Interdata, 1974]
makes the opposite assumption, namely that jumps
are long until proven short. This was rejected as
the size of one jump is often dependent on another,
so that one of them will be short if and only if
both of them are short. By assuming them long
either they will never be found to be short, or the
process will have to examine all the jumps
repeatedly trying each jump in turn to see if it
can be "squeezed". Commonly enabling the "SQUEZ"
option in the CAL assembler can double or treble
the time to assemble programs.
With the assumption that all jumps start short and
then grow, all truly short jumps will be found with
no possibility of infinite loops, as the process
must terminate, in the worse case when all the
jumps have been made long.
Several methods for achieving this optimisation
have been described [Szymanski, 1978; Williams,
1978].
The technique used by the third phase of the IMP77
compilers for stretching jumps is as follows.
Once the linkage map has been constructed and
addresses provisionally allocated, all labels and
references to them are grouped according to the
block in which they occurred. This is to take
advantage of the fact that most references will be
local. A procedure STRETCH is now defined which
repeatedly attempts to lengthen each reference
within a particular group. If a reference is found
which must be stretched, the entry in the linkage
map is updated and all subsequent entries are
suitably modified to take account of the increased
size of the code. The process is repeated until no
alterations have been made.
STRETCH is first called once for each group of
references in the program. This "local stretch"
commonly resolves up to 80% of the references. A
final call on STRETCH is then made with all the
references lumped together as one group in order to
resolve references between blocks, and any local
references which, although processed by the local
stretch, have become invalidated by changes made by
the "global stretch".
The use of a local and a global stretch has a
considerable effect on the performance of the
compiler; If the calls on "local stretch" are taken
out "global stretch" has to do all the work in
ignorance of the block-structure of the labels.
This involves repeated searching of the complete
label and reference lists in order that changes in
the position of these items may be recorded. On
the Interdata 7/32 this increases the stretching
time for 1968 branches from 2.3 seconds out of a
total compilation time of 146 seconds up to 35
seconds!
The time taken to perform the stretching using both
local and global stretch is on average just over if
of the total compilation time excluding the time
for input and output.
Wulf et al. describe an optimisation on the PDP11
which attempts to shorten otherwise long
conditional jumps by making them jump to suitable
jumps to the same destination as this is smaller
and faster than the six byte instructions which
would be generated by default [Wulf, 1975]. This
was tried but eventually removed from the PDP11
compiler as finding suitable jumps was a tedious
task and of the average 2% of jumps which were long
in compiling many programs only one case was found
where the optimisation could be applied. That case
was in a program specially constructed to test the
optimisation.
At the same time that jumps and labels are being
processed, certain operations which depend on the
flow of control may be inserted into the code.
The GEC 4080 provides a good example of this
problem which can be handled elegantly by the third
phase.
The machine provides arithmetic instructions which
take either fixed point or floating point operands
depending on the state of a processor status bit.
This bit must be altered by the instructions SIM
(Set Integer Mode) and SFM (Set Floating Mode).
During code generation when a label is encountered
the state of the status bit will not in general be
known, and so a suitable mode switching instruction
will need to be planted.; frequently this
instruction will be redundant. Given the presence
of the third phase, the second phase merely needs
to mark jumps with the current state of the bit,
and to mark labels with the required state (and the
previous state of the bit if control can "fall
through" past the label. During the process of
expanding jumps these mark bits can be checked. If
all references to a label have the same mode, no
action needs to be taken, but if the bits differ
the appropriate instruction must be added. As an
extra improvement if only one jump to a label is
from the wrong mode, the mode switching instruction
can be planted before that jump rather than after
its destination label, so shortening the execution
paths when no change is required.
ii Conflating jumps to jumps.
Nested conditional structures in high-level
languages often generate jumps which take control
directly to another jump. If the second jump can
be shown always to be taken whenever the first is,
the latter can be redefined as jumping directly to
the destination of the former.
e.g. -----------------------------------
| while N > 0 cycle |
| N = N-1 |
| if N > 5 then TEST1 else TEST2 |
| repeat |
-----------------------------------
In this program following the call on TEST1 the
else causes a jump to be taken to the next
statement (repeat). This statement is simply a
jump back to the previous cycle.
Hence the following code can be generated (3220):
-------------------
| $1: L 1,N |
| BLE $3 |
| SIS 1,1 |
| ST 1,N |
| CHI 1,5 |
| BLE $2 |
| BAL 15,TEST1 |
| B $1 |
| $3: BAL 15,TEST2 |
| B $1 |
-------------------
The danger with this optimisation is that an
otherwise short jump can be expanded to a long jump
as the following program demonstrates:
----------------------
| if X = 1 start |
| if Y = 1 start |
| {A} |
| else |
| {B} |
| finish |
| else |
| {C} |
| finish |
----------------------
The else following the sequence (A) causes a jump
to the next else which jumps past the finish. In
that form the first jump only has to skip (B) and
is likely to be a short jump. If it is made to
jump directly to the second finish it has to cover
{B} and {C}, so reducing the chances of its being
short.
Equally, the position can be reversed resulting in
the optimised jump being short when the original
was long. If this problem is considered serious
the third phase can check the sort of jump which
would be generated and act accordingly.
iii Removal of jumps round jumps.
Statements such as:
------------------
| ->LABEL if X = Y |
------------------
are common, either in the explicit for as given
above or in some higher-level representation such
as:
---------------
| exit if X = Y |
---------------
The simple code sequence generated for this would
be similar to (3220):
--------------
| L 1,X | pick up X
| C 1,Y | compare with Y
| BNE $1 | branch not equal
| B LABEL | jump to LABEL
| $1: |
--------------
by combining the two branches the code can be
reduced to:
--------------
| L 1,X |
| C 1,Y |
| BE LABEL |
--------------
While it is possible for the code generator to do
this immediately it was decided to leave the
optimisation to the third phase for four reasons:
1 The third phase can perform this optimisation
simply, almost as a side-effect of
constructing the linkage map.
2 The are several cases where the optimisation
can be extended in ways which would be awkward
for the second phase to deal with. In
particular, it would have either to look ahead
or to be able to modify code sequences already
generated. With a third phase, however, the
optimisation reduces to a straightforward
inspection of the linkage map. For example:
being executed and so will need to have a jump round them or
will have to be planted in a "hole" In the code, that is
between an unconditional jump and the next label. As holes
occur frequently in high-level languages (for example
following every else or repeat) and do not require extra
code to be planted round the constants, they must be the
preferred position for the constants. In order to minimise
the number of constants planted it is necessary to delay the
dumping of them until the last possible moment, making them
as near the forward limit of the addressability of the first
outstanding reference. This increases the chance on a
subsequent reference to the constant being able to address
the previous location.
This poses problems if the second phase is to handle the
constants as it cannot know which is the optimum position
for the constants in advance of producing the code
(especially if the code is to be reordered).
A convenient solution is to utilise the linkage table in the
third phase and include in it references to constants and
the locations of holes and "forced" holes, that is places
where an extra jump is required.
Following the initial resolution of jumps (4.7.?) the list
of constants can be examined and holes allocated. The
labels are processed again to take account of the extra code
and any alignment limitations. During the processing of the
object stream the constants are infiltrated into the object
file.
levels of code production and machine-specific optimisation
were appreciably different.
This gave rise to three convenient properties with regard to
testing and development:
i An error in one compiler will frequently give
notice of similar faults in others. Clearly, any
faults in the common first phase will be present in
all the compilers and only one correction will be
required.
ii An improvement in the performance of one compiler,
or the code it generates, can suggest similar
improvements in others.
iii The third effect on reflection seems obvious yet
was noted with some surprise. The systems on which
most of the investigation was done are run with
very different operating systems and used by
different types of user. These two factors
together caused a great spread in the demands
placed upon the compiler, resulting in more parts
of the compiler being thoroughly tested than would
happen when running on one particular system, where
users tend to be more stereotyped. Questions of
"proper practice" aside, it is a fact of life that
all software gets a better testing in the field
than at the hands of its creator.
5.3 Diagnostics
As mentioned previously optimisation is not just a
process of improving the storage requirements and speed of a
program but also involves fitting a program into the overall
framework of the run-time environment. In many applications
the provision of extensive run-time checks and post-mortem
traces can be of great importance. The ability to generate
such diagnostic code has certain implications for the
features in the intermediate code.
5.3.1 Line numbers
When producing information about the state of a
computation, whether it be an error report following a
run-time fault or an execution trace [Satterthwaite, 1972],
the data must be presented in a form which is meaningful to
the user in terms of the source program. The
commonly-provided dump of the machine state, registers, code
addresses etc., is a complete failure in this respect, as
the correspondence between this and the program state
depends on the workings of the compiler and other factors of
which the user should not need to be aware.
The simplest way of specifying the point of interest in a
program is to give its line number. There are two common
techniques for providing line number information at
run-time, the choice of which depends on the uses to which
the compiler is to be put.
The first is to plant instructions which dynamically update
a variable with the current line number whenever it changes.
This has the significant advantages that it is extremely
cheap to implement and the line number is always immediately
available. Its obvious disadvantages are that it increases
the execution time for the program, and more significantly
it increases the size of the program, typically by about 6K
bytes on the Interdata 7/32 for a 1000 line program,
approximately a 50% increase.
The second technique is to build a table giving the
correspondence between line numbers and the addresses of the
associated code sequences. While this imposes a greater
burden on the compiler and takes more time to extract the
line number it has the advantage that it does not increase
the code size of the program, nor does it alter its
execution speed. Indeed it may even be possible to keep the
table out of main memory until it is required.
The choice of technique will have implications on the
compiled code. If the line number table approach is used
error procedures must have available the address of the
point of the error. The effects of this can be seen in the
following example of the sort of code generated for
unassigned variable checking on the 7/32 using both methods:
-------------
| 17 X = Y |
-------------
------------- -------------
| LHI 0,17 | | update line no
| ST 0,LINE | |
| L 1,Y | L 1,Y |
| C 1,UV | C 1,UV | check value
| BE ERROR | | give the error
| | BAL 8,TU | test for error
| ST 1,X | ST 1,X |
| | .. .. |
| | .. .. |
| |TU:BNER 8 | return if OK
| | B ERROR| give the error
------------- -------------
As the generated code depends on the method in use it cannot
be specified in the intermediate code and so the latter must
simply indicate the points in the program at which the line
number changes.
those which on detection cause the normal flow of control to
be interrupted, and those which simply make the knowledge of
the occurrance of the error available to the program, for
example by setting a condition-code bit. For the purposes
of this discussion the second form of hardware-detected
error may be considered an error which is not detected
automatically as it still requires explicit instructions to
test for the error and to transfer control accordingly.
Clearly the more errors that fall into the automatic
category the better, as they do not cause the user's program
to grow with sequences of instructions which, in a correct
program, will always be testing for conditions which never
arise.
These differences complicate the design of intermediate
codes as the classification differs from machine to machine:
the VAX all forms of overflow can be made to generate
automatic interrupts, but the PDP11 only sets a
condition-code bit on some overflows.
There are two basic ways of handling this in the
intermediate code: firstly the code can contain explicit
requests for the checks to be performed, and secondly the
code can be designed in such a way as to give the
code-generator enough information to be able to decide where
checks are necessary.
Two specific examples can indicate which of these ways
should be adopted.
Testing for arithmetic overflow is currently handled by
machines in three main ways:
1. An interrupt is generated whenever overflow occurs.
This is by far the best method as it requires no
overheads in the checked code.
2. A bit is set on overflow and is only cleared when
it is tested. This requires explicit checks in the
code but several tests may be conflated into a
single test at an appropriate point, for example
before the final result is stored.
3. A bit is set on overflow, but is cleared by the
next arithmetic operation. This again requires
explicit checking code but the tests must be
inserted after every operation.
For the intermediate code to indicate where overflow
testing is to be performed it would have to choose the
worst case from the three above, namely case 3. This
would result in a test being requested after every
arithmetic instruction, which test may just as well be
included into the definition of the instructions
themselves.
The other area of low-level testing is in implied type
conversions such as storing from a 32-bit integer into a
16-bit integer. The VAX provides an instruction which
combines the test for truncation with the store (CVTLW).
The 7/32 has an instruction (CVHR) which can test the
value before assignment, and the 4/75 can most
efficiently test following the assignment (CH).
If the request for the check is a separate intermediate
code item, the 7/32 case is simple but the other
machines will require much more work to be able to
generate the efficient check. The problem can be
simplified by introducing new assignment instructions
which also perform the test, but this adds many new
instructions to the code as one instruction will be
required for every valid combination of types and every
sort of assignment.
The high-level checks such as array bound checking are
usually so complicated that the most efficient
implementations depend greatly on the particular
hardware, so much so that it would be foolish to attempt
to express them in the intermediate code.
The simplest solution is to ensure that the intermediate
code provides enough information to let the code
generator decide where and what checks are necessary.
The inclusion of checks against the use of unassigned
variables provides a good example of the power of
leaving the checking to the code-generator. In a
simple-minded approach the code-generator tests every
suitable value loaded from store. A minor improvement
to this is to mark the descriptor for every local
variable in a block when it is first assigned,
inhibiting the marking after the first jump.
Subsequently, marked objects need not be checked.
A much better improvement may be obtained by making a
trivial extension to the register remembering mechanism.
If an object is 'known' it must have been used
previously, and hence it will have been checked if
necessary. Even after the register which held the value
of the object has been altered, and hence the
association between the register and the object lost, if
the compiler remembers that the value was known it can
suppress any unassigned checks on future references.
At this point a useful property of IMP77 may be used to
great effect: once a variable has been assigned it
cannot become unassigned. This is not true in many
languages, as for example, in ALGOL60 the control
variable of a for loop is undefined (unassigned) at the
end of the loop. This means that in IMP77 the 'was
known' property of variables may be preserved across
procedure calls, even though all the register content
information must be forgotten.
This technique when applied on the 7/32 compiler results
in a reduction of 33% in the code required for checking.
While it is possible for the unassigned checks to be
placed in the intermediate code and for the first phase
to remove redundant checks, this supression would
require a duplication of the remembering logic which
must, in any case, reside in the machine-dependent
phase.
phases, but it removes the optimisations from the low-level
details of code production and provides a means for
separating the machine-independent and machine-dependent
optimisations. In particular in the same way as much of tho
code generator can be built from a standard "kit" with a few
special machine-specific parts, so the global optimiser can
utilise code from other optimisers.
The way in which the optimiser can influence the
operation of the code generator is by making use of the fact
that the intermediate code does not describe a computation
but a compilation process. This compilation is driven by
the descriptors which are normally translated by the code
generator from the machine-independent form in the I-code
into the appropriate machine-dependent representation,
reflecting the target machine architecture: registers,
stacks, memory etc. By short-circuiting this translation a
global optimiser can force the use of specific machine
features.
For example consider the following fragment of an integer
function:
-------------------
| integer X |
| X = A(J) |
| X = 0 if X < 0 |
| result = X |
-------------------
The standard I-code produced for this fragment would have
the form:
------------------------
| DEF 12 "X" INTEGER |
| SIMPLE DEFAULT NONE |
| NONE |
| PUSH 12 | - X
| PUSH 6 | - A
| PUSH 7 | - J
| ACCESS |
| ASSVAL |
| PUSH 12 | - X
| PUSHI 0 |
| COMP >= 1 |
| BGE 1 |
| PUSH 12 | - X
| PUSHI 0 |
| ASSVAL |
| LOC 1 |
| PUSH 12 | - X
| RESULT |
------------------------
On the PDP11 the code generated for this could be:
-------------------
| MOV J,R2 |
| ADD R2,R2 | Scale the index
| ADD A,R2 | Add in ADDR(A(0))
| MOV (R2),X | A=A(J)
| BGE $1 | ->$1 if X >= 0
| CLR X | X = 0
| $1: MOV X,R1 | assign result register
| {return} |
-------------------
Here the obvious optimisation is to note that the local
variable, X, is eventually to be used as the result of the
function and so needs to end in register 1.
By changing the definition of X in the I-code into:
------------------------------------------------
| DEF x X INTEGER SIMPLE DEFAULT NONE SPECIAL R1 |
------------------------------------------------
and making no other changes, the code generator will produce
code of the form:
--------------------
| MOV J,R2 |
| ADD R2,R2 |
| ADD A,R2 |
| MOV (R2),R1 |
| BGE $1 |
| CLR R1 |
| $1: {return} |
--------------------
As this process necessitates the I-code becoming more and
more intimately involved with the structure of the target
machine, in that it starts referring directly to registers
and the like, it is necessary that a new control item be
added so that the code generator may be prevented from
pre-empting resources which the optimiser is manipulating.
The new item is RELEASE and it is used in conjunction with
the definition of machine-dependent descriptors. When such
a descriptor is introduced (using DEF) the associated target
machine component is considered to have been claimed and may
only be used in response to explicit direction from the
I-code. On receipt of the corresponding RELEASE the
component is once again made available for implicit use by
the code generator (for temporaries etc.). This mechanism
is an exact parallel to the way in which memory locations
are claimed by the definition of descriptors and released by
the END of the enclosing block.
The main assumption about this style of optimisation is
that the code generator has the ability to generate any
required instruction, provided that the pertinent
information is available at the required time.
As an example, the VAX 11/780 provides addressing modes in
which the value in a register may be scaled and added into
the effective operand address before the operand is used,
hence the following code:
------------------------
| integerarray A(1:9) |
|- -|
| A(J) = 0 |
| |
| MOVL J,R5 | pick up J
| CLRL 12(R3)[R5] | A(J) = 0
------------------------
The operand address generated by the CLRL instruction is:
------------------
| 12+R3 + R5*4 |
------------------
as there are 4 bytes (address units) to a longword.
This instruction can be generated naturally during the
non-optimised evaluation of array subscripts, and so the
optimiser can assume that the index mode of operand will be
used whenever a register operand is specified as an array
index.
The procedure has the added advantage that in the worst case
when the code generator will not produce the instructions
that the optimiser hoped, as long as the optimised I-code
still describes the required compilation, the code generator
will simply produce a more long-winded, but equally valid
version of the program. In other words, as long as some
choice is available and some temporary objects are left at
the disposal of the code generator, the optimiser cannot
force it into a state where working code cannot be produced.
In the example above even if the code generator does not
produce index mode operands, it can still generate sequences
of the form:
----------------------
| MULL3 R5,#4,R1 | R5*4 -> R1
| ADDL2 R3,R1 | R3+R1 -> R1
| CLRL 12(R1) | 0 -> (12(R1))
----------------------
6.2 Performance
The figures to appendix A3 are the results of measuring
the effect of various optimisations on the Interdata 7/32
and the DEC PDP 11/45.
One problem in choosing programs to be measured in that
heavy use of particular language features will increase the
overall effect of certain optimisations.
As a trivial example of this consider the following
"program":
--------------------------------
| begin |
| integerarray A(1:1000) |
| A(1) = 0 |
| endofprogram |
--------------------------------
With all array optimisations enabled, on the 7/32 this
generates 30 bytes of code, whereas without the optimisation
it results in 170 bytes of code, largely due to the
procedure for declaring the array.
Clearly a reduction of 82% is not to be expected on more
typical programs.
Similarly the absence of features will bias the results.
In particular the smaller programs will not demonstrate the
power of the optimisations which only take effect when
various size limits have been exceeded: the most obvious
such limits being addressing restrictions caused by the size
of address fields in instructions.
The major difficulty in producing results which are of
any real value is that the effects of the optimisations
depend on the individual style in which the programs under
consideration were written. Inevitably users get a "feel"
for the sort of statement for which the compiler generates
good code and they modify their style of programming
accordingly. If at some state in its development a compiler
produces poor code for a particular construction users will
tend to avoid that construction, even long after the
compiler has been improved and can compile it effectively.
This well-known phenomenon [Whitfield, 1973] argues strongly
that users should never see the object code generated by the
compilers they are using.
The effects of many optimisations are difficult if not
impossible to measure with any degree of accuracy as they
interact with other optimisations to a great deal. The most
obvious interaction is that between the size of jump
instruction required and most of the other optimisations.
The size of jump is determined by the amount of code
generated between the jump and the label it references. If
any other optimisation is inhibited this volume of code is
likely to increase, decreasing the chances of being able to
use the shorter forms of the jump.
Some optimisations depend almost totally on others; it is
unlikely that the optimisation of reducing or removing the
entry and exit code sequences associated with procedures
(section 4.5.1) would have much effect if the parameters
were not passed in registers and references to them in the
procedures were replaced by references to those registers.
In particular, it must be noted that it is always possible
to generate programs which will benefit greatly from those
optimisations which do not appear to be of much use from the
figures given. However, the test programs used to derive
the figures are typical of the programs processed by the
compiler, and it is hoped that they give a more realistic
and balanced view of the improvements which may be achieved
in 'real' cases.
Under some circumstances it may be advantageous to apply
all optimisations even though some may appear to give little
benefit, since this 'squeezing the pips' frequently removes
one or two instructions from critical loops in a program.
Yet again this shows the difficulty in quantifying the
usefullness of optimisations as they are so dependent on the
particular circumstances.
One area of measurement has been deliberately omitted
from the figures, namely the effect on execution time of the
optimisations. This was for several reasons:
1. On the systems used it was impossible to get
reliable timing measurements with any accuracy
greater than about plus or minus 5%.
2. For the reasons given previously, many programs
could benefit from fortuitous optimisations which
could not be expected every time.
3. Programs which would run long enough to improve the
accuracy of the measurements invariably lost this
accuracy through spending much time in the
system-provided procedures, mainly for input and
output. This point in particular suggests that as
the overhead is beyond the control of the general
user, the savings in code space may be much more
important. Even with ever-growing store sizes,
virtual memory systems will continue to treat
smaller programs better than larger ones.
4. Some of the optimisations, particularly passing
parameters in registers, prevent the compiled
program from running, unless the controlling
environment is modified in a parallel way. This
would invalidate the timings as the environment is
not usually under the control of the compiler.
From the crude measures which were obtained there is a
suggestion that the decrease in execution time roughly
parallels the decrease in code size.
6.3 Cost of optimisation
The cost of an optimisation is, in general, very
difficult to measure, as may be seen by considering the
three relevant areas: compile time, space requirement, and
logical complexity.
6.3.1 Compile time
In order to generate good code, the compiler must spend
time looking for the cases which are amenable to
improvement. If no optimisation is performed this time is
not used and so the compilation should take less time.
However, the non-optimised version commonly requires the
production of more code than the optimised version,
frequently over fifty percent more when comparing fully
diagnostic code with fully optimised code. On all the
compilers written so far, the time saved by not having to
generate these extra instructions more than outweighs the
time spent in deciding not to generate them.
maintaining multiple sets of information about each
register and searching for particular values will
probably rule out such extended remembering
optimisations.
Perhaps a surprising result is that the PDP11 on
average gains about as much from this optimisation
as the 7/32.
This is the result of two interacting effects.
Firstly, the 7/32 dedicates up to five registers to
address local variables in the last five levels of
procedure nesting and locks three for other fixed
purposes, leaving about ten for intermediate
calculations. The PDP11, however, uses a display
in store the access intermediate levels and has to
load the address of a particular level each time it
is required. In addition the PDP11 implementation
fixes the use of four registers, leaving only four
for intermediate calculations.
Secondly, the 7/32 needs to use at least one
register to move values around while the PDP11
often requires none.
These two effects give a fairly large number of
transient values in the registers of the 7/32, and
a smaller number of more frequently used values
(addresses) in the registers of the PDP11. On
average it appears that the number of times
necessary values are found is roughly equal in the
two cases.
6.4.7 Merging
The large difference between the effect of
forward merging on the 7/32 and the PDP11 is mainly
due to the addressing modes available on the
machines.
On the PDP11 statements of the form "A=B" can be
compiled into a single instruction "MOV_B,A",
ignoring any extra instructions which may be needed
to make A and B addressable. However, on the 7/32
all values must be moved via the registers,
resulting in two instructions for the same
statement:
-------------
| L 1,B |
| ST 1,A |
-------------
Hence the following code:
-------------------------------
| if X=0 then Y=1 else Y=12 |
-------------------------------
7/32 PDP11
------------------- -------------------
| L 1,X | TST X |
| BNE $1 | BNE $1 |
| LIS 2,1 | |
| ST 2,Y | MOV #1,Y |
| B $2 | BR $2 |
| $1:LIS 2,12 | $1: |
| ST 2,Y | MOV #12.,Y |
| $2: | $2: |
------------------- -------------------
With the 7/32 code, merging can reduce the sequence
by one instruction, a "STore", while with the PDP11
no such improvement is possible.
As the techniques for merging and delaying are
quite expensive, but not complicated, and have a
major influence on the design of the code-generator
the small gains achieved are probably not worth the
trouble, unless the last drop of efficiency is
required at all costs.
6.5.2 I/O overhead
One of the disadvantages of dividing a compiler into
several distinct phases is that it results in an additional
cost in communicating between consecutive phases. As
discussed in section 5.1 this cost depends on the system
running the compiler. In the worst case where communication
is achieved using conventional files the overhead may not be
too serious.
The time spent doing input and output on the Interdata 7/32
compiler is about 27% of the total compilation time, and for
the PDP11 is about 22%, breaking down as follows:
-------------------------------------------------------
| |
| ---- ---- ---- |
| | | | |---->| | |
| Source ----> | P1 |---->| P2 | | P3 |----> Object |
| | | | |---->| | |
| ---- ---- ---- |
| |
| 7/32: 7% 7% 10% 3% |
| (4%) (4%) (5%) (3%) |
| |
| PDP11: 9.4% 11% 0.6% 0.5% |
-------------------------------------------------------
The figures in parentheses give the percentage of time taken
when the input and output requests are made directly to the
file manager rather than via the standard subsystem
procedures, thus reducing the internal I/O overhead to about
10% of the total compilation time.
One important gain in using such intermediate codes is
that they can ease the difficulties associated with
maintaining a number of compilers for different machines,
when those compilers are self-compiling.
For several reasons it may not be desirable to permit sites
to have the source of the machine-independent phase:
commonly to give freedom of choice for the form of the
language in which that phase is written and to prevent local
"improvements" which rapidly lead to non-standard language
definitions. In such cases the intermediate-code generator
can be maintained at one site and updated versions can be
distributed in the intermediate code form without fear of
compromising the quality of the object code generated from
it.
Such a technique is currently being used in the production
of portable SIMULA compilers [Krogdahl, 1980].
6.5.4 Flexibility
At some stage in producing a compiler the needs of the
end user must be considered. The flexibility afforded by
the high-level nature of the intermediate code allows the
compiler to be adapted to fit its environment. If the
compiler is to be used for teaching the quality of the code
it produces can be sacrificed for compilation speed and
high-quality diagnostics, particularly as compilation time
may well be an order of magnitude greater than the execution
time, Indeed many of the programs will fail to compile and
"add byte" instruction. There are "add immediate
short" and "subtract immediate short" instructions
but multiply, divide, and, or etc. can not have
short immediate operands.
ii The instructions should be consistent, that is,
logically similar instructions should behave in
similar fashions.
Again, on the Perkin-Elmer 3200:
Load fullword (L) and load halfword (LH) set the
condition code but load byte (LP) does not.
Most register-store instructions can be replaced by
a load of the appropriate type followed by a
register-register instruction: e.g.
-------------- --------------
| CH 1,X | LH 0,X |
| | CR 1,0 |
-------------- --------------
both result in the same setting of the condition
code, but
-------------- --------------
| CLB 1,B | LB 0,B |
| | CLR 1,0 |
-------------- --------------
could result in different settings of the condition
code as CLR compares two unsigned 32 bit quantities
whereas CLB compares a zero-extended byte from
store with the zero-extended least significant byte
of register 1, For consistency either compare
halfword (CH) should use the sign-extended less
significant half of the register, or better, CLB
should not tamper with the value in the register.
iii Complex instructions should be avoided. There are
two reasons for this. Firstly, it is easier for a
compiler to break down statements into simple
operations than it is to build them up into complex
ones [Stockton-Gaines, 1965]. Secondly, if the
complex instructions do not perform the exact
function required by the language more instructions
will be needed to "prepare" for the complex
instruction and to "undo" its unwanted effects. As
an example, the DEC VAX11/780 is full of complex
instructions which seem to be well-suited to
high-level languages at first glance but on closer
inspection they are not so useful. A CASE
instruction is provided which indexes into a table
of displacements and adds the selected value to the
program counter. This would seem ideal for
compiling SWITCH jumps. Unfortunately, as the
table of displacements follows the CASE instruction
it would be very expensive to use it each time a
jump occurred using a particular switch. Instead
all references to the switch must jump to a common
CASE instruction. Even this does not help, as in
the event of an attempted jump to a non-existent
switch label the diagnostics or the event mechanism
will see the error as having occurred at the wrong
place in the program. Although this problem can be
"programmed around" it turns out that it is faster
to implement switches using sequences of simpler
instructions.
iv Machine designers should investigate carefully the
full consequences of building-in special fixed uses
of machine features. One of the best examples of a
clear oversight which causes grief to compiler
writers is found in the DATA GENERAL NOVA
multiplication instruction. This instruction
multiplies the value in register 1 by register 2
and places the double-length result in registers 0
and 1. As only registers 2 and 3 may be used for
addressing, and as register 3 is always used for
subroutine linkage, it follows that register 2 must
be used for addressing the local stack frame, but
this is exactly the register which must be
corrupted in order to use the multiply instruction!
Although specific machines have been used in the examples
similar problems abound in all machines. Indeed it is clear
that machines are most commonly designed for programmers
writing in assembler or FORTRAN, and furthermore writing
their programs in a particular style.
While it is clear that the problems could be called "mere
details" and that they are not difficult to surmount, it
remains that they complicate otherwise simple
code-generation algorithms making compilers larger, slower
and correspondingly more difficult to write, debug, and
maintain.
In conclusion it appears that the machine most suited to
supporting high-level languages should have a small but
complete set of very simple instructions, their simplicity
permitting rapid execution and great flexibility.
area, or created specially. Items on the stack
are modified by intermediate code control items
to reflect operations specified in the source
program. Such modifications may or may not
result in code being generated. From the point
of view of this definition stack elements are
considered to have at least three components:
i Type
ii Value
iii Access rule
The "Access rule" defines how the "Type" and
"Value" attributes are to interpreted in order to
locate the described object.
For example, the access rule for a constant could
be "Value contains the constant" while for a
variable it could be "Value contains the address
of the variable". Clearly, the access rules are
target-machine dependent. Descriptors may be
combined to give fairly complex access rules, as
in the case of applying "PLUS" to the stack when
the top two descriptors are for the variable X
and the constant 1 resulting in one descriptor
with the access rule "take the value in X and add
1 to it". The complexity of these access rules
may be restricted by a code-generator. In the
example above code could be generated to evaluate
X+1 resulting in an access rule "the value is in
register 1", say.
The importance of the code not describing the actual
computation which the source program specified but the
compilation process required is seen when attempting to use
the code for statements of the form:
A := if B = C then D else E;
This could not be encoded as:
PUSH A
PUSH B
PUSH C
JUMP # L1
PUSH D
BR L2
LOC L1
PUSH E
LOC L2
ASSVAL
The reason is that the items on the stack at the time of the
ASSVAL would be (from top to bottom) [E], [D], [A], because
no items were given which would remove them from the stack.
hence the ASSVAL would assign the value of E to D and then
leave A dangling on the stack.
Unless otherwise stated, all constants in the
intermediate code are represented in octal.
Descriptors
DEF TAG TEXT TYPE FORM SIZE SPEC PREFIX
This item causes a new descriptor to be generated
and placed in the descriptor area. On creation,
the various fields of the DEF are used to
construct the machine-dependent representation
required for the object.
TAG is an identification which will
be used subsequently to refer to
the descriptor.
TEXT is the source-language identifier
given to the object (a null
string if no identifier was
specified).
TYPE is the type of the object:
GENERAL, INTEGER, REAL, STRING,
RECORD, LABEL, SWITCH, FORMAT.
FORM is one of: SIMPLE, NAME, ROUTINE,
FN, MAP, PRED, ARRAY, NARRAY,
ARRAYN, NARRAYN.
SIZE is either the TAG of the
appropriate record format
descriptor for records, the
maximum length of a string
variable, or the precision of
numerical variables: DEFAULT,
BYTE, SHORT. LONG.
SPEC has the value SPEC or NONE
depending on whether or not the
item is a specification.
PREFIX is one of: NONE, OWN, CONST,
EXTERNAL, SYSTEM, DYNAMIC, PRIM,
PERM or SPECIAL. If SPECIAL is
given there will follow an
implementation-dependent
specification of the properties
of the object (such as that it is
to be a register, for example).
Parameters and Formats
The parameters for procedures and the elements of record
formats are defined by a list immediately following the
procedures or format descriptor definition:
START Start of definition list
FINISH End of definition list
ALTBEG Start of alternative sequence
ALT Alternative separator
ALTEND End of alternative sequence.
Blocks
BEGIN Start of BEGIN block
END End of BEGIN block or procedure
PUSH <tag> Push a copy of the descriptor <tag> onto
the stack.
PROC <tag> This is the same as PUSH except that the
descriptor being stacked represents a
procedure which is about to be called
(using ENTER).
PUSHI <n> Push a descriptor for the integer constant
<n> onto the stack.
PUSHR <r> Push a descriptor for the real
(floating-point) constant <r> onto the
stack.
PUSHS <s> Push a descriptor for the string constant
<s> onto the stack.
SELECT <tag> TOS will be a descriptor for a record.
Replace this descriptor with one describing
the sub-element <tag> of this record.
Assignment
ASSVAL Assign the value described by TOS to the
variable described by SOS. Both TOS and
SOS are popped from the stack.
ASSREF Assign a reference to (the address of) the
variable described by TOS to the pointer
variable described by SOS. Both TOS and
SOS are popped from the stack.
JAM This is the same as ASSVAL except that the
value being assigned will be truncated if
necessary.
ASSPAR Assign the actual parameter described by
TOS to the formal parameter described by
SOS. This is equivalent to either ASSVAL
(for value parameters) or ASSREF (for
reference parameters).
RESULT TOS describes the result of the enclosing
function. Following the processing of the
result code must be generated to return
from the function.
MAP Similar to RESULT except that TOS describes
the result of a MAP. Again a return must
be generated.
DEFAULT <n>
INIT <n> Create N data items, corresponding to the
last descriptor defined, and given them all
an initial (constant) value. The constant
is popped from the stack in the case of
INIT but DEFAULT causes the
machine-dependent default value to be used
(normally the UNASSIGNED pattern).
Binary operators
ADD Addition
SUB Subtraction
MUL Multiplication
QUOT Integer division
DIVIDE Real division
IEXP Integer exponentiation
REXP Real exponentiation
AND Logical AND
OR Logical inclusive OR
XOR Logical exclusive OR
LSH Logical left shift
RSH Logical right shift
CONC String concatenate
ADDA ++
SUBA --
The given operation is performed on TOS and SOS , both of
which are removed from the stack, and the result
(SOS op TOS) is pushed onto the stack.
e.g. A = B-C
PUSH A
PUSH B
PUSH C
SUB ASSVAL
Unary Operators
NEG Negate (unary minus)
NOT Logical NOT (complement)
MOD Modulus (absolute value)
The given operation is performed on TOS.
Arrays
DIM <d> <n> The stack will contain <d> pairs of
descriptors corresponding to the lower and
upper bounds for an array. This
information is used to construct <n> arrays
and any necessary accessing information for
use through the last <n> descriptors to
have been defined. All of these
descriptors will be for similar arrays.
INDEX SOS will be the descriptor for a
multi-dimensional array and TOS will be the
next non-terminal subscript. The stack is
popped.
ACCESS SOS will be the descriptor of an array and
TOS will be the final/only subscript. Both
descriptors are replaced by a descriptor
for the appropriate element of the array.
E.g. given arrays A(1:5) and B(1:4, 2:6),
and integers J,K:
A(J) = 0 K = B(J, K)
PUSH A PUSH K
PUSH J PUSH B
ACCESS PUSH J
PUSHC 0 INDEX
ASSVAL PUSH K
ACCESS
ASSIGN
Internal labels
Internal labels are those labels in the
intermediate code which have been created
by the process of translating from the
source program, and so do not appear
explicitly in the source program. The main
property of these labels is that they will
only be referred to once. This fact can be
used to re-use these labels, as, for
example, a forward reference to a
currently-defined label must cause its
redefinition.
LOCATE <1> define internal label <1>
GOTO <1> forward jump to internal label <1>
REPEAT <1> backward jump to internal label <1>
Conditional branches
These branches are always forward.
JUMPIF <cond> <label>
JUMPIFD <cond> <label>
JUMPIFA <cond> <label>
Where: <cond> ::= =, #,
<, <=,
>, >=,
TRUE, FALSE
The two items on the top of the stack are compared and a
jump is taken to <label> is the condition specified by
<cond> is true. In the case of <cond> being TRUE or FALSE
only one item is taken from the stack, and this represents a
boolean value to be tested.
User Labels
LABEL <d> locate label descriptor <d>
JUMP <d> Jump to the label described by <d>
CALL <d> Call the procedure described by <d>
Sundry Items
ON <e> <l> Start of event trap for events <e>.
Internal label <l> defines the end of the
event block.
EVENT <e> Signal event <e>
STOP stop
MONITOR monitor
RESOLVE <m> Perform a string resolution
FOR Start of a for loop
SLABEL <sd> Define switch label
SJUMP <sd> Select and jump to switch label
LINE <l> Set the current line number to <l>
Remembering values in registers
Code Total Incremental
Size Reduction Reduction
---- --------- ---------
P732.1 0 uses 9504 - -
1 use 8194 13.8% 13.8%
2 uses 8192 13.8% 0.0%
P732.2 0 uses 6500 - -
1 use 6126 5.8% 5.8%
2 uses 6126 5.8% 0.0%
P732.3 0 uses 10960 - -
1 use 9968 9.0% 9.0%
2 uses 9956 9.2% 0.2%
P732.4 0 uses 5288 - -
1 use 4970 6.0% 6.0%
2 uses 4958 6.2% 0.2%
P732.5 0 uses 5468 - -
1 use 4990 8.7% 8.7%
2 uses 4986 8.8% 0.1%
P732.6 0 uses 3424 - -
1 use 3208 6.3% 6.3%
2 uses 3208 6.3% 0.0%
P732.7 0 uses 10736 - -
1 use 9880 8.0% 8.0%
2 uses 9874 8.0% 0.0%
P732.8 0 uses 824 - -
1 use 770 6.6% 6.6%
2 uses 770 6.6% 0.0%
P732.9 0 uses 6448 - -
1 use 6148 4.6% 4.6%
2 uses 6148 4.6% 0.0%
P732.10 0 uses 22968 - -
1 use 20656 10.1% 10.1%
2 uses 20650 10.1% 0.0%
P732.11 0 uses 13996 - -
1 use 12470 10.9% 10.9%
2 uses 12442 11.1% 0.2%
P732.12 0 uses 32600 - -
1 use 28532 12.5% 12.5%
2 uses 28392 12.9% 0.4%
Code Total Incremental
Size Reduction Reduction
---- --------- ---------
P11.1 0 uses 9060 - -
1 use 7712 14.9% 14.9%
2 uses 7660 15.4% 0.5%
P11.2 0 uses 6276 - -
1 use 6000 4.4% 4.4%
2 uses 6000 4.4% 0.0%
P11.3 0 uses 9992 - -
1 use 9480 5.1% 5.1%
2 uses 9444 5.5% 0.4%
P11.4 0 uses 5020 - -
1 use 4772 5.4% 5.4%
2 uses 4768 5.6% 0.2%
P11.5 0 uses 5096 - -
1 use 4460 12.5% 12.5%
2 uses 4452 12.6% 0.1%
P11.6 0 uses 3692 - -
1 use 3064 17.0% 17.0%
2 uses 4064 17.0% 0.0%
P11.7 0 uses 7976 - -
1 use 7060 11.5% 11.5%
2 uses 7030 11.8% 0.3%
P11.8 0 uses 668 - -
1 use 652 2.4% 2.4%
2 uses 624 6.6% 4.2%
P11.9 0 uses 4888 - -
1 use 4492 8.1% 8.1%
2 uses 4484 8.3% 0.2%
P11.10 0 uses 20318 - -
1 use 19120 5.9% 5.9%
2 uses 19120 5.9% 0.0%
P11.11 0 uses 12938 - -
1 use 12162 6.0% 6.0%
2 uses 12148 6.1% 0.1%
P11.12 0 uses 12068 - -
1 use 10594 12.2% 12.2%
2 uses 10584 12.3% 0.0%
Remembering sets of registers (environments)
Code Total Incremental
Size Reduction Reduction
---- --------- ---------
P732.1 0 environments 8556 - -
1 environment 8316 2.8% 2.8%
2 environments 8238 3.7% 0.9%
3 environments 8232 3.8% 0.1%
4 environments 8222 3.9% 0.1%
5 environments 8218 4.0% 0.1%
6 environments 8192 4.2% 0.2%
P732.2 0 environments 6202 - -
1 environment 6128 1.2% 1.2%
2 environments 6130 1.2% 0.0%
3 environments 6126 1.2% 0.0%
4 environments 6126 1.2% 0.0%
5 environments 6126 1.2% 0.0%
6 environments 6126 1.2% 0.0%
P732.3 0 environments 10174 - -
1 environment 10062 1.1% 1.1%
2 environments 9968 2.0% 0.9%
3 environments 9966 2.0% 0.0%
4 environments 9964 2.1% 0.1%
5 environments 9956 2.1% 0.1%
6 environments 9956 2.1% 0.1%
P732.4 0 environments 5068 - -
1 environment 4978 1.8% 1.8%
2 environments 4958 2.2% 0.4%
3 environments 4958 2.2% 0.0%
4 environments 4958 2.2% 0.0%
5 environments 4958 2.2% 0.0%
6 environments 4958 2.2% 0.0%
P732.6 0 environments 3262 - -
1 environment 3250 0.4% 0.4%
2 environments 3216 1.4% 1.0%
3 environments 3208 1.7% 0.3%
4 environments 3208 1.7% 0.0%
5 environments 3208 1.7% 0.0%
6 environments 3208 1.7% 0.0%
P732.7 0 environments 10062 - -
1 environment 9970 0.9% 0.9%
2 environments 9894 1.7% 0.8%
3 environments 9880 1.8% 0.1%
4 environments 9874 1.9% 0.1%
5 environments 9874 1.9% 0.0%
6 environments 9874 1.9% 0.0%
P732.8 0 environments 806 - -
1 environment 782 3.0% 3.0%
2 environments 782 3.0% 0.0%
3 environments 770 4.5% 1.5%
4 environments 770 4.5% 0.0%
5 environments 770 4.5% 0.0%
6 environments 770 4.5% 0.0%
P732.9 0 environments 6244 - -
1 environment 6202 0.7% 0.7%
2 environments 6156 1.4% 0.7%
3 environments 6158 1.4% 0.0%
4 environments 6148 1.5% 0.1%
5 environments 6148 1.5% 0.0%
6 environments 6148 1.5% 0.0%
P732.10 0 environments 21214 - -
1 environment 20928 1.3% 1.3%
2 environments 20748 2.2% 0.9%
3 environments 20678 2.5% 0.3%
4 environments 20678 2.5% 0.0%
5 environments 20668 2.6% 0.1%
6 environments 20650 2.6% 0.0%
P732.11 0 environments 12772 - -
1 environment 12592 1.4% 1.4%
2 environments 12486 2.2% 0.8%
3 environments 12472 2.3% 0.1%
4 environments 12460 2.4% 0.1%
5 environments 12452 2.5% 0.1%
6 environments 12442 2.6% 0.1%
P732.12 0 environments 11522 - -
1 environment 11418 0.9% 0.9%
2 environments 11342 1.6% 0.7%
3 environments 11314 1.8% 0.2%
4 environments 11314 1.8% 0.0%
5 environments 11296 2.0% 0.2%
6 environments 11296 2.0% 0.0%
P11.1 0 environments 7686 - -
1 environment 7670 0.2% 0.2%
2 environments 7660 0.3% 0.1%
3 environments 7660 0.3% 0.0%
4 environments 7660 0.3% 0.0%
5 environments 7660 0.3% 0.0%
6 environments 7660 0.3% 0.0%
P11.2 0 environments 6012 - -
1 environment 6000 0.2% 0.2%
2 environments 6000 0.2% 0.0%
3 environments 6000 0.2% 0.0%
4 environments 6000 0.2% 0.0%
5 environments 6000 0.2% 0.0%
6 environments 6000 0.2% 0.0%
P11.3 0 environments 9472 - -
1 environment 9440 0.3% 0.3%
2 environments 9444 0.3% -0.0%
3 environments 9444 0.3% 0.0%
4 environments 9444 0.3% 0.0%
5 environments 9444 0.3% 0.0%
6 environments 9444 0.3% 0.0%
P11.4 0 environments 4784 - -
1 environment 4776 0.2% 0.2%
2 environments 4776 0.2% 0.0%
3 environments 4776 0.2% 0.0%
4 environments 4776 0.2% 0.0%
5 environments 4772 0.2% 0.0%
6 environments 4768 0.3% 0.1%
P11.5 0 environments 4512 - -
1 environment 4464 1.1% 1.1%
2 environments 4456 1.2% 0.1%
3 environments 4452 1.3% 0.1%
4 environments 4452 1.3% 0.0%
5 environments 4452 1.3% 0.0%
6 environments 4452 1.3% 0.0%
P11.6 0 environments 3076 - -
1 environment 3070 0.2% 0.2%
2 environments 3064 0.4% 0.2%
3 environments 3064 0.4% 0.0%
4 environments 3064 0.4% 0.0%
5 environments 3064 0.4% 0.0%
6 environments 3064 0.4% 0.0%
P11.7 0 environments 7104 - -
1 environment 7048 0.8% 0.8%
2 environments 7048 0.8% 0.0%
3 environments 7048 0.8% 0.0%
4 environments 7048 0.8% 0.0%
5 environments 7048 0.8% 0.0%
6 environments 7032 1.0% 1.0%
P11.8 0 environments 640 - -
1 environment 624 2.5% 2.5%
2 environments 624 2.5% 0.0%
3 environments 624 2.5% 0.0%
4 environments 624 2.5% 0.0%
5 environments 624 2.5% 0.0%
6 environments 624 2.5% 0.0%
P11.9 0 environments 4492 - -
1 environment 4484 0.2% 0.2%
2 environments 4484 0.2% 0.0%
3 environments 4484 0.2% 0.0%
4 environments 4484 0.2% 0.0%
5 environments 4484 0.2% 0.0%
6 environments 4484 0.2% 0.0%
P11.10 0 environments 19332 - -
1 environment 19196 0.7% 0.7%
2 environments 19158 0.9% 0.2%
3 environments 19138 1.0% 0.1%
4 environments 19138 1.0% 0.0%
5 environments 19120 1.1% 0.1%
6 environments 19120 1.1% 0.0%
P11.11 0 environments 12280 - -
1 environment 12200 0.6% 0.6%
2 environments 12168 0.9% 0.3%
3 environments 12160 1.0% 0.1%
4 environments 12156 1.0% 0.0%
5 environments 12148 1.1% 0.1%
6 environments 12148 1.1% 0.0%
P11.12 0 environments 10690 - -
1 environment 10616 0.7% 0.7%
2 environments 10604 0.8% 0.1%
3 environments 10604 0.8% 0.0%
4 environments 10594 0.9% 0.1%
5 environments 10584 1.0% 0.1%
6 environments 10584 1.0% 0.0%
Simple allocation of arrays and remembering subscripts
Allocation Remembering
Neither Simple (gain) Subscripts (gain)
------- ---- ------ ---------- ------
P732.1 8596 8476 (1.4%) 8312 (3.3%)
P732.2 6126 6126 (0.0%) 6126 (0.0%)
P732.3 10450 1014 (3.2%) 10426 (0.2%)
P732.4 5056 4958 (1.9%) 5056 (0.0%)
P732.5 5306 5054 (4.7%) 5308 -(0.0%)
P732.6 3384 3254 (3.8%) 3386 -(0.0%)
P732.7 10346 10112 (2.3%) 10344 (0.0%)
P732.8 806 806 (0.0%) 770 (4.5%)
P732.9 6138 6138 (0.0%) 6148 -(0.2%)
P732.10 20806 20684 (0.6%) 20776 (0.1%)
P732.11 12442 12442 (0.0%) 12442 (0.0%)
P732.12 11976 11946 (0.2%) 11326 (5.4%)
Both optimisations Total gain
------------------ ----------
P732.1 8192 4.7%
P732.2 6126 0.0%
P732.3 9956 4.7%
P732.4 4958 1.9%
P732.5 4986 6.0%
P732.6 3208 5.2%
P732.7 9874 4.6%
P732.8 770 4.5%
P732.9 6148 -0.2%
P732.10 20650 0.7%
P732.11 12442 0.0%
P732.12 11296 5.8%
Allocation Remembering
Neither Simple (gain) Subscripts (gain)
------- ---- ------ ---------- ------
P11.1 8572 8188 (4.5%) 7704 (10.1%)
P11.2 6000 6000 (0.0%) 6000 (0.0%)
P11.3 9764 9556 (2.1%) 9644 (1.2%)
P11.4 4848 4776 (1.5%) 4848 (0.0%)
P11.5 4656 4568 (1.9%) 4452 (4.4%)
P11.6 3356 3202 (4.6%) 3218 (4.1%)
P11.7 7844 7728 (1.4%) 7204 (8.2%)
P11.8 644 624 (3.1%) 644 (0.0%)
P11.9 4796 4796 (0.0%) 4484 (6.5%)
P11.10 19236 19140 (0.5%) 19216 (0.1%)
P11.11 12148 12148 (0.0%) 12148 (0.0%)
P11.12 11094 11060 (0.3%) 10616 (4.3%)
Both optimisations Total gain
------------------ ----------
P11.1 7660 10.6%
P11.2 6000 0.0%
P11.3 9444 3.3%
P11.4 4768 1.6%
P11.5 4452 4.4%
P11.6 3064 8.7%
P11.7 7032 10.4%
P11.8 624 3.1%
P11.9 4484 6.5%
P11.10 19120 0.6%
P11.11 12148 0.0%
P11.12 10584 4.6%
Simplifying: X = X op Y
Code Code
Without With Gain
------- ---- ----
P732.1 8292 8192 1.2%
P732.2 6156 6126 0.5%
P732.3 10068 9956 1.1%
P732.4 5088 4958 2.6%
P732.5 5180 4986 3.7%
P732.6 3368 3208 4.8%
P732.7 11438 11296 1.2%
P732.8 772 770 0.2%
P732.9 6214 6148 1.1%
P732.10 21086 20650 2.1%
P732.11 12590 12442 1.2%
P732.12 11438 11296 1.2%
P11.1 8284 7660 7.5%
P11.2 6220 6000 3.5%
P11.3 10040 9444 5.9%
P11.4 5136 4768 7.2%
P11.5 4800 4452 7.2%
P11.6 3342 3064 8.3%
P11.7 7596 7032 7.4%
P11.8 668 624 6.6%
P11.9 4724 4484 5.1%
P11.10 20634 19128 7.3%
P11.11 12894 12148 5.8%
P11.12 11492 10584 7.9%
Passing some parameters in registers
Code Total Incremental
Size Reduction Reduction
---- --------- ---------
P732.1 0 registers 8862 - -
1 register 8360 5.7% 5.7%
2 registers 8192 7.6% 1.9%
P732.2 0 registers 7196 - -
1 register 6544 9.1% 9.1%
2 registers 6126 14.9% 5.8%
P732.3 0 registers 10586 - -
1 register 9976 5.8% 5.8%
2 registers 9956 6.0% 0.2%
P732.4 0 registers 5126 - -
1 register 4958 3.3% 3.3%
2 registers 4958 3.3% 0.0%
P732.5 0 registers 5198 - -
1 register 5022 3.4% 3.5%
2 registers 4986 4.1% 0.7%
P732.6 0 registers 3402 - -
1 register 3222 5.3% 5.3%
2 registers 3208 5.7% 0.4%
P732.7 0 registers 10400 - -
1 register 10048 3.4% 3.4%
2 registers 9874 5.0% 1.6%
P732.8 0 registers 840 - -
1 register 810 3.6% 3.6%
2 registers 770 8.3% 4.7%
P732.9 0 registers 6404 - -
1 register 6172 3.6% 3.6%
2 registers 6148 4.0% 0.4%
P732.10 0 registers 21650 - -
1 register 20826 3.8% 3.8%
2 registers 20650 4.6% 0.8%
P732.11 0 registers 13476 - -
1 register 12442 7.7% 7.7%
2 registers 12442 7.7% 0.0%
P732.12 0 registers 11916 - -
1 register 11452 3.9% 3.9%
2 registers 11296 5.2% 1.3%
Code Total Incremental
Size Reduction Reduction
---- --------- ---------
P11.1 0 registers 7796 - -
1 register 7756 0.5% 0.5%
2 registers 7660 1.7% 1.2%
P11.2 0 registers 6192 - -
1 register 6072 1.9% 1.9%
2 registers 6000 3.1% 1.2%
P11.3 0 registers 9564 - -
1 register 9448 1.2% 1.2%
2 registers 9444 1.2% 0.0%
P11.4 0 registers 4776 - -
1 register 4768 0.2% 0.2%
2 registers 4768 0.2% 0.0%
P11.5 0 registers 4508 - -
1 register 4452 1.2% 1.2%
2 registers 4452 1.2% 0.0%
P11.6 0 registers 3098 - -
1 register 3064 1.1% 1.1%
2 registers 3064 1.1% 0.0%
P11.7 0 registers 7124 - -
1 register 7096 0.4% 0.4%
2 registers 7032 1.3% 0.9%
P11.8 0 registers 624 - -
1 register 624 0.0% 0.0%
2 registers 624 0.0% 0.0%
P11.9 0 registers 4520 - -
1 register 4488 0.7% 0.7%
2 registers 4484 0.8% 0.1%
P11.10 0 registers 19302 - -
1 register 19166 0.7% 0.7%
2 registers 19128 0.9% 0.2%
P11.11 0 registers 12364 - -
1 register 12152 1.7% 1.7%
2 registers 12148 1.7% 0.0%
P11.12 0 registers 10734 - -
1 register 10648 0.8% 0.8%
2 registers 10584 1.4% 0.6%
Remembering condition-codes
Unknown Remembered Gain
------- ---------- ----
P732.1 8820 8192 0.3%
P732.2 6134 6126 0.1%
P732.3 9976 9956 0.2%
P732.4 4968 4958 0.2%
P732.5 4988 4986 0.0%
P732.6 3212 3208 0.1%
P732.7 9880 9874 0.1%
P732.8 770 770 0.0%
P732.9 6150 6148 0.0%
P732.10 20684 20650 0.2%
P732.11 12474 12442 0.2%
P732.12 11318 11296 0.2%
P11.1 7732 7660 0.9%
P11.2 6012 6000 0.3%
P11.3 9516 9444 0.8%
P11.4 4792 4768 0.5%
P11.5 4452 4452 0.0%
P11.6 3076 3064 0.4%
P11.7 7064 7032 0.4%
P11.8 624 624 0.0%
P11.9 4496 4484 0.3%
P11.10 19204 19128 0.4%
P11.11 12192 12148 0.4%
P11.12 10626 10584 0.4%
Forward merging and delaying
Forward
Neither Merge Delaying Merge & Delay
------- ----- -------- -------------
P732.1 8192 8172 (0.2%) 8160 (0.4%) 8136 (0.7%)
P732.2 6126 6110 (0.3%) 6054 (1.2%) 6044 (1.3%)
P732.3 9956 9948 (0.2%) 9872 (0.8%) 9864 (0.9%)
P732.4 4958 4950 (0.2%) 4942 (0.3%) 4942 (0.3%)
P732.5 4986 4970 (0.3%) 4982 (0.1%) 4966 (0.4%)
P732.6 3208 3194 (0.4%) 3140 (2.1%) 3122 (2.7%)
P732.7 9874 9752 (1.2%) 9834 (0.4%) 9812 (1.6%)
P732.8 770 764 (0.5%) 738 (4.2%) 728 (5.4%)
P732.9 6148 6136 (0.2%) 6132 (0.3%) 6120 (0.4%)
P732.10 20650 20586 (0.3%) 20558 (0.4%) 20490 (0.8%)
P732.11 12442 12406 (0.3%) 12342 (0.8%) 12306 (1.1%)
P732.12 11296 11280 (0.1%) 11272 (0.2%) 11256 (0.4%)
P11.1 7660 7660 (0.0%) 7660 (0.0%) 7660 (0.0%)
P11.2 6000 6000 (0.0%) 5988 (0.2%) 5988 (0.2%)
P11.3 9444 9444 (0.0%) 9434 (0.1%) 9434 (0.1%)
P11.4 4768 4768 (0.0%) 4768 (0.0%) 4768 (0.0%)
P11.5 4452 4448 (0.1%) 4452 (0.0%) 4448 (0.1%)
P11.6 3064 3064 (0.0%) 3060 (0.1%) 3060 (0.1%)
P11.7 7032 7004 (0.4%) 7032 (0.0%) 7004 (0.4%)
P11.8 624 620 (0.6%) 620 (0.6%) 616 (1.3%)
P11.9 4484 4484 (0.0%) 4484 (0.0%) 4484 (0.0%)
P11.10 19128 19128 (0.0%) 19128 (0.0%) 19128 (0.0%)
P11.11 12148 12136 (0.1%) 12136 (0.1%) 12124 (0.2%)
P11.12 10584 10584 (0.0%) 10584 (0.0%) 10584 (0.0%)
All optimisations
None All Gain
---- --- ----
P732.1 11300 8136 28.0%
P732.2 7520 6044 19.6%
P732.3 12286 9864 19.7%
P732.4 5782 4942 14.5%
P732.5 6204 4966 19.9%
P732.6 4004 3122 22.0%
P732.7 11750 9812 16.5%
P732.8 988 728 26.3%
P732.9 6848 6120 10.6%
P732.10 24722 20490 17.1%
P732.11 14618 12306 15.8%
P732.12 14064 11256 20.0%
P11.1 9664 7660 20.7%
P11.2 6588 5988 9.1%
P11.3 11092 9434 14.9%
P11.4 5540 4768 13.9%
P11.5 5572 4448 20.2%
P11.6 3666 3060 16.5%
P11.7 8632 7004 18.7%
P11.8 752 616 18.1%
P11.9 5256 4484 14.7%
P11.10 23940 19128 20.1%
P11.11 14328 12124 15.4%
P11.12 12816 10584 17.8%
Total CPU time and CPU time in input/output
Internal I/O: communication from phase 1 to phase 3.
External I/O: input of the source and the output of the object files.
Phase 1 Phase 2 Phase 3 Overall
----------- ---------- ---------- ------------------
Total % Total % Total % Internal External
CPU I/O CPU I/O CPU I/O I/O I/O
All optimisations:
P732.1: 53% 8% 32% 6% 14% 5% 13% 7%
P732.2: 53% 8% 35% 8% 12% 5% 16% 7%
P732.3: 51% 7% 34% 7% 15% 5% 13% 6%
P732.4: 54% 8% 32% 6% 14% 5% 13% 7%
P732.5: 49% 12% 36% 5% 14% 5% 12% 6%
P732.6: 50% 7% 37% 7% 12% 6% 14% 7%
P732.7: 48% 7% 39% 7% 13% 7% 15% 6%
P732.8: 54% 7% 36% 7% 10% 5% 16% 4%
P732.9: 50% 8% 36% 8% 14% 6% 15% 7%
P732.10: 54% 6% 29% 5% 17% 4% 11% 6%
P732.11: 52% 7% 32% 6% 16% 5% 11% 6%
P732.12: 52% 8% 32% 8% 17% 7% 15% 7%
No optimisation:
P732.1: 50% 7% 30% 7% 19% 8% 16% 7%
P732.2: 55% 8% 31% 9% 14% 8% 16% 8%
P732.3: 54% 8% 28% 8% 18% 6% 15% 7%
P732.4: 57% 8% 26% 7% 16% 6% 14% 7%
P732.5: 52% 8% 30% 8% 17% 7% 16% 7%
P732.6: 53% 8% 32% 9% 15% 8% 17% 8%
P732.7: 49% 7% 31% 8% 19% 8% 17% 6%
P732.8: 56% 7% 32% 9% 12% 6% 18% 5%
P732.9: 55% 9% 29% 8% 16% 7% 16% 8%
P732.10: 55% 6% 23% 6% 21% 5% 13% 6%
P732.11: 55% 8% 26% 6% 20% 6% 12% 6%
P732.12: 54% 8% 27% 8% 19% 8% 16% 8%
References
Aho, A.V. & Ullman, J.P. (1972)
"Optimisation of straight line programs"
SIAM J. March 1972.
Allen, F.E. & Cocke, J.A. (1971)
"A catalogue of optimising techniques"
Design and optimisation of compilers
Prentice-Hall, R. Rustin (ed). 1971.
Basili, V.R. & Turner, A.J. (1975)
"A transportable extendable compiler".
Software - Practice and Experience Vol 5,
1975.
Bell. G. (1973)
"Threaded Code".
CACM Vol 16, No 6.
Berry, R.E. (1978)
"Experience with the PASCAL P-compiler"
Software - Practice and experience
Vol 8, No 5, 1978.
Berthaud, M. & Griffiths, M. (1973)
"Incremental compilation and conversational
interpretation"
Annual review in automatic programming,
Vol 7, Part 2, 1973.
Bourne, S.R,, Birrell, A.D. & Walker, I.W. (1975)
"Z code, an intermediate object code for
ALGOL68".
The Computer Laboratory, Cambridge.
Branquart, P., Cardinael, I. & Lewi, J. (1973)
"Optimised translation process, application to
ALGOL68".
Int. Comp. Symp. 1973.
Bron, C. & De Vries, W. (1976)
"A PASCAL compiler for PDP11 minicomputers".
Software - Practice and Experience Vol 6,
1976.
Brooker, R.A. et al.
"Experience with the compiler-compiler"
Comp. J. Vol 9, 1967.
Brown, P. (1972)
"Levels of language for portable software"
CACM. 15, 1972.
Brown, P. (1977)
"Macro processors".
Software Portability (ed. P. Brown)
Cambridge University Press, 1977.
Buckle, J.K. (1978)
"The ICL 2900 Series".
Macmillan Press Ltd., 1978.
Cocke, J. (1970)
"Global common subexpression elimination"
ACM SIGPLAN notices, Vol 5, No 7. 1970.
Cocke, J. & Schwartz, J.T. (1970)
"Programming languages and their compilers"
Courant Institute of Mathematical Sciences,
Hew York University, 1970.
Coleman, S.S., Poole, P.C. & Waite, W.M. (1974)
"The Mobile programming system, JANUS".
Software: Practice and Experience. Vol 4,
1974.
Dewar, H. (1975)
"The ISYS system".
Internal memo, Department of Computer Science,
University of Edinburgh, 1975.
Feldman, J.A. (1966)
""A formal semantics for computer languages
and its application in a compiler-compiler"
CACM 9, 1966.
Feldman, J.A. & Gries, D. (1968)
"Translator writing systems"
CACM 11, 1968.
Gardner, P. (1977)
"An ALGOL68C bootstrap".
Memo CSM-20, University of Essex, 1977.
Gilmore, B.A.C. (1980)
"DEIMOS User's Manual"
Edinburgh Regional Computing Centre
Grosse-Lindemann, C.O. & Nageli, H.H. (1976)
"Postlude to a PASCAL compiler bootstrap on
the DEC system 10".
Software - Practice and Experience, Vol 6,
1976.
Haddon, B.K. & Waite, W.M. (1978)
"Experience with the Universal Intermediate
language JANUS"
Software - Practice and Experience
Vol 8, No 5, 1978.
Halpern, M.I. (1965)
"Machine Independence: its technology and
economics".
CACM Vol 8 No 12, 1965.
Hawkins, E.N. & Huxtable, D.H.R. (1963)
"A multi-pass translation scheme for Algol 60"
Annual review in automatic programming, No 3,
1963.
Hecht, M.S. & Ullman, J.D. (1975)
"A simple algorithm for global data flow
analysis problems"
SIAM J. Vol 4, 1975.
Huxtable, D.H.R. (1964)
"On writing an optimising translator for Algol
60"
Introduction to System Programming,
P. Wegner (ed.),
Academic Press, 1964.
Interdata (1974)
"Common Assembler Language (CAL) User's
Manual".
Interdata publication number 29-375R04. 1974.
Knuth, D.E. (1974)
"Structured programming with GOTO statements".
ACM Computing Surveys 6, No 4, 1974.
Krogdahl, S., Myhre, O., Robertson, P.S., & Syrrist, G.
(1980)
The SCALA project: S-PORT.
Norwegian Computing Centre working paper.
Lecarme, O. & Peyrolle-Thomas, M. (1978)
"Self-compiling compilers: An appraisal of
their implementation and portability"
Software - Practice and Experience
Vol 8, No 2, 1978.
Lowry, E.S. & Medlock, C.W. (1969)
"Object code optimisation".
CACM Vol 12, 1969.
McKeeman, W.M. (1965)
"Peephole Optimisation"
CACM Vol 8, 1965.
Mock, O., Olsztyn, J. et al (1958)
"The problem of programming communication with
changing machines".
CACM Vol 1, No 8, 1958.
Nelson, P.A. (1979)
"A comparison of PASCAL intermediate
languages"
ACM SIGPLAN Notices, Vol 14, No 8, 1979.
Nori, K.V., Ammann, U. et al (1974 & 1976)
"The PASCAL <P>-compiler: Implementation
notes".
Eidgenossische Technische Hochschule, Zurich,
1976.
Pavelin, C.J. (1970)
"The improvement of program behaviour in paged
computer systems".
Ph.D. Thesis, University of Edinburgh, 1970.
Poole, P. (1974)
"Portable and adaptable compilers"
Advanced course on compiler construction,
Springer-Verlag 1974.
Richards, M. (1971)
"The portability of the BCPL compiler".
Software - Practice and Experience VOl 1,
1971.
Richards, M. (1977)
"Portable Compilers".
Software Portability (ed. P. Brown),
Cambridge University Press, 1977.
Robertson, P.S. (1977)
"The IMP-77 Language".
Internal Report CSR-19-77, Department of
Computer Science, University of Edinburgh,
1977.
Russell, W. (1974)
"A translator for PASCAL P-code".
Final year project report, Department of
Computer Science, University of Edinburgh,
1974.
Satterthwaite, E. (1972)
Debugging tools for high-level languages".
Software - Practice and Experience, Vol 2,
1972.
Schneck, P.B. & Angel. E. (1973)
"A FORTRAN to FORTRAN optimising compiler".
Computer Journal Vol 16, 1973.
Sibley, R.A. (1961)
"The SLANG System"
CACM Vol 4, 1961.
Spier, M.J. (1976)
"Software Malpractice - A distasteful
experience".
Software - Practice and Experience Vol 6,
1976.
Steel, T.B. (1961)
"A first version of UNCOL
Proc. AFIPS WJCC 19, 1961.
Stephens, P.D. (1974)
"The IMP language and compiler"
Computer Journal Vol 17, 1974.
Stockton-Gaines, R. (1965)
"On the translation of machine language
programs".
CACM Vol 8, No 12, 1965.
Szymanski, T.G.
"Assembling code for machines with
span-dependent instructions".
CACM 21, 1978.
Tanenbaum, A.S. (1976)
"Structured Computer Organisation"
Prentice/Hall International, 1976.
Thompson, K. & Ritchie, D.M. (1974)
"The UNIX timesharing system"
CACM Vol 17, No 7, 1974.
Trout. R.G. (1967)
"A compiler-compiler system"
Proc. ACM, 1967.
Waite, W.M. (1970)
"The Mobile Programming System: STAGE 2"
CACM, vol 13, 1970.
Welsh, J. & Quinn, C. (1972)
"A PASCAL compiler for the ICL 1900 series".
Software - Practice and Experience, Vol 2,
1972.
Whitfield, H. & Wight, A.S. (1973)
"EMAS - The Edinburgh Multi-Access System"
Computer Journal Vol 16, 1973.
Wichmann, B.A. (1977)
"Optimisation"
Software Portability (ed. P. Brown),
Cambridge University Press, 1977.
Wichmann, B.A. (1977)
"How to call procedures, or second thoughts on
Ackermann's function"
Software - Practice and Experience
Vol 7, No 3, 1977.
Williams, M.H.
"Long/short address optimisation in
assemblers".
Software - Practice and Experience,
Vol 9 No 3, 1978
Wulf, W., Johnsson, R.K., Weinstock, C.B., Hobbs, S.O. &
Geschke, C.M. (1975)
"The design of an optimising compiler"
Elsevier Computer Science Library, 1975.
Last updated on 2005-Mar-03 17:19:07 by