Date: Tue, 01 Aug 89 15:16:17 +0100
From: dcw@ukc.ac.uk
To: rde@ukc.ac.uk
Subject: Superoptimizer


	I just came across a copy of the news item about the "superoptimizer"
that we were talking about recently.

		David

From eagle!ukc!mcvax!uunet!husc6!mailrus!ames!ncar!oddjob!mimsy!chris Mon Jul  4 09:45:25 BST 1988
Article 4992 of comp.arch:
Path: eagle!ukc!mcvax!uunet!husc6!mailrus!ames!ncar!oddjob!mimsy!chris
>From: chris@mimsy.UUCP (Chris Torek)
Newsgroups: comp.arch
Subject: ASPLOS: `Superoptimiser'
Message-ID: <12296@mimsy.UUCP>
Date: 3 Jul 88 05:14:51 GMT
Reply-To: chris@mimsy.umd.edu (Chris Torek)
Organization: University of Maryland, Dept. of Computer Sci.
Lines: 200

Now that I have it in front of me:

	Second International Converence on Architectural Support
	for Programming Languages and Operating Systems (ASPLOS II)

	October 5-8, 1987
	Palo Alto, California

	Computer Society Order Number 805
	Library of Congress Number 87-80438
	IEEE Catalog Number 87CH2440-6
	ISBN 0-8186-0805-6
	SAN 264-620X
	ACM Order Number 556870
	ISBN 0-89791-238-1

(no, I have no idea why there are two ISBNs)

The paper itself is (pp. 122--126):

	Superoptimizer -- A Look at the Smallest Program

	Henry Massalin

	Department of Computer Science
	Columbia University
	New York, NY 10027

	ABSTRACT

	Given an instruction set, the superoptimizer finds the shortest
	program to compute a function.  Startling programs have been
	generated, many of them engaging in convoluted bit-fiddling
	bearing little resemblance to the source programs which defined
	the functions.  The key idea in the superoptimizer is a
	probabilistic test that makes exhaustive searches practical for
	programs of useful size.  The search space is defined by the
	processor's instruction set, which may include the whole set,
	but is typically restricted to a subset.  By constraining the
	instructions and observing the effect on the output program,
	one can gain insight into the design of instruction sets.  In
	addition, superoptimized programs may be used by peephole
	optimizers to improve the quality of generated code, or by
	assembly language programmers to improve manually written
	code.

[Here it seems appropriate to attach the ACM copyright:

    Permission to copy without fee all or part of this material is
    granted provided that the copies are not made or distributed for
    direct commercial advantage, the ACM copyright notice and the title
    of the publication and its date appear, and notice is given that
    copying is by permission of the Association for Computing
    Machinery.  To copy otherwise, or to republish, requires a fee
    and/or specific permission.

    (c) 1987 ACM 0-89791-238-1/87/1000-0122 $00.75

`$00.75'? :-) ]

I think it is important to note section 4.1, Current Limitations:

    Even with the probabilistic test, the exhaustive search still grows
    exponentially with the number of instructions in the generated
    program.  The current version of superoptimizer has generated
    programs 12 instructions long in several hours running time on a
    16MHz 68020 computer.  Therefore, the superoptimizer has limited
    usefulness as a code generator for a compiler.

Here are the 68020 routines for the various functions (I use the MIT/Sun
mnemonics below; I like them better):

signum:
	| x in d0:
		addl	d0,d0
		subxl	d1,d1
		negxl	d0
		addxl	d1,d1
	| signum(x) in d1

abs:
	| x in d0:
		movl	d0,d1
		addl	d1,d1
		subxl	d1,d1
		eorl	d1,d0
		subl	d1,d0
	| abs(x) in d0

max:
	| d0=X, d1=Y
		subl	d1,d0
		subxl	d2,d2
		orl	d2,d0
		addxl	d1,d0
	| max(X,Y) in d0

min:
	| d0=X, d1=Y
		subl	d1,d0
		subxl	d2,d2
		andl	d2,d0
		addl	d1,d0
	| min(X,Y) in d0

minmax:
	| d0=X, d1=Y
		subl	d1,d0
		subxl	d2,d2
		andl	d0,d2
		eorl	d2,d0
		addl	d1,d0
		addl	d2,d1
	| d0=max(X,Y); d1=min(X,Y)

logicals:
	| X in d0
		negl	d0
		subxl	d0,d0
	| d0 = X==0 ? 0 : -1
		negl	d0
	| d0 = X==0 ? 0 : 1

	| X in d0
		negl	d0
		subxl	d0,d0
		notl	d0
	| d0 = X==0 ? -1 : 0

	| X in d0
		negl	d0
		subxl	d0,d0
		addql	#1,d0
	| d0 = X==0 ? 1 : 0

8 digit BCD to binary (done in three steps by Superoptimizer):
	| 8 digit BCD number in d0
		movl	d0,d1
		andl	#0xf0f0f0f0,d1
		lsrl	#3,d1
		subl	d1,d0
		subl	d1,d0
		subl	d1,d0
		movl	d0,d1
		andl	#0xff00ff00,d1
		lsrl	#1,d1
		subl	d1,d0
		lsrl	#2,d1
		subl	d1,d0
		lsrl	#3,d1
		addl	d1,d0
		movl	d0,d1
		swap	d1
		mulu	#0xd8f0,d1
		subl	d1,d0
	| binary equivalent in d0

Multiplication by constants:
	| d0 *=  29
		movl	d0,d1
		lsll	#4,d0
		subl	d1,d0
		addl	d0,d0
		subl	d1,d0

	| d0 *= 39
		movl	d0,d1
		lsll	#2,d0
		addl	d1,d0
		lsll	#3,d0
		subl	d1,d0

	| d0 *= 156
		movl	d0,d1
		lsll	#2,d1
		addl	d1,d0
		lsll	#5,d0
		subl	d1,d0

	| d0 *= 625
		movl	d0,d1
		lsll	#2,d0
		addl	d1,d0
		lsll	#3,d0
		subl	d1,d0
		lsll	#4,d0
		addl	d1,d0

and an interesting note on division:
    A general divide by constant that works for all 32-bit arguments is
    too long to realize any time gain over the divide instruction, and
    is certainly not shorter.  Additionally, there doesn't seem to be
    any nifty arithmetic-logical operations [sic] that simplify the
    process.  The generated programs just multiply by the reciprocal of
    the constant.  Since we do an exhaustive search, this negative
    result can be seen as a confirmation of the inherent high cost of
    divisions for the instruction sets [68020 and 80?86] considered.
