This is the text version of the file http://www.mit.edu/people/yandros/doc/taste.ps.gz.
G o o g l e automatically generates text versions of documents as we crawl the web.
To link to or bookmark this page, use the following url: http://www.google.com/search?q=cache:aBm996ps07AC:www.mit.edu/people/yandros/doc/taste.ps.gz&hl=en&ie=UTF-8


Google is not affiliated with the authors of this page nor responsible for its content.

More Taste: Less Greed?

orSending UNIX to the Fat Farm

C H Forsyth Department of Computer ScienceUniversity of York

HeslingtonYork YO1 5DD

England +44 904 432753forsyth@minster.york.ac.uk

ABSTRACT You (like us) have 80 to 90 Sun 3/50 machines with 4 megabytes of memory. Youhave been given some optical discs containing System V.4. Which can you least afford to discard? Things are getting out of hand. Soon, 32 megabyte Ultrasparks will be neededjust for us to start the window system in under a minute.

For UNIX, now in middle-age rotundly recalling its sprightly youth, mere `tuning' willnot cause that heavy code to slip away. We need to reconsider and re-implement the system interface periodically, to take account of changes in its environment. We must bewilling to

throw things away, discarding parts of the older implementation completely,rather than corrupting clean new mechanisms to approximate the mistakes of the past.

To illustrate this thesis, I shall discuss work I have done on SunOS 3.5 to reduce its sizeand complexity. The virtual memory system has been replaced by a simpler one using ideas from the EMAS system and elsewhere. A stream IO system in the 8th/9th Editionstyle has been added, replacing the old character IO system, pipes, and UNIX domain sockets. I have also made some preliminary forays into the Virtual File System interface.

1. INTRODUCTION Many papers describe various ways of dealing with the complexity of `programming in the large'. Thereare fewer that take a large program apart to try to find out why it is large, how it grew, and whether it can be

made rather smaller. The commonly available versions of the UNIX system have been growing relentlesslywith each new release and each new feature; the `standard' UNIX kernel is now a moderately large program, surrounded by very large applications. I have watched it become much harder to understand, main-tain, and modify. This troubles me directly (since I provide system software support), and so over the past few years I have made various attempts, very much part-time, to recode or rewrite parts of the system to besmaller and simpler. The work began on a VAX with the 3rd Berkeley Distribution, and continues on the Sun 3 series with SunOS 3.5. The changes have been growing more extensive with time. Here, I shall pick examples from various parts of the system to show how certain designs match the struc-ture of UNIX well, and thus make things easier; and conversely, where there is a serious clash, the implementation becomes much more complicated and the facilities less useful. It can also be less painful in theend to replace some older things completely, rather than pile hack upon hack. I claim it is a mistake to standardise (or to have standardised) certain parts of the system interface in their current form. Background The UNIX kernel I ran until 1987 on our VAX machines was a `re-port' of 7th Edition to the VAX, but itincluded: a recoded 3BSD paging system, many devices, an X.25 interface, and my own implementations of

ideas from 8th Edition UNIX (which was not generally available), specifically the file system switch[Wei84], /proc[Kil84], and a `stream' IO system [Rit84]. The latter included pseudo-terminals, the message and connection disciplines, and stream drivers and modules that provided some of the Ethernet andInternet protocols: deuna, arp, ip, and udp. There were two peculiar special files: /dev/time and /dev/wait. By the standards then prevailing the kernel was acceptably small:

- 2 - /unix: 95280+20400+146252 = 261932b = 0x3ff2cb (All kernel tables are included in the data and bss sizes; IO buffers are not.) Starting with `layers' [Pik83]code by Simon Kenyon, Nigel Roles wrote an implementation of Pike's window system mpx[Pik84] that

was good enough to run programs such as the text editor jim for the Atari ST. We had a promisingenvironment.

In 1987, however, we decided for financial reasons to switch to Suns (discless 3/50s supported by fileservers). I was disconcerted, not to say disheartened, to find that the Sun kernel code was about 390 kbytes, even properly configured (as distributed it was 490k). It had notionally been built on 4.2 BSD, but thechanges to support the Virtual File System and `vnodes' [Kle86] were extensive, although the character IO system and socket systems were largely unaffected. Investigation I decided to discover why the system was so much larger and (apparently) complicated than previous UNIXsystems. I thought that a good way to do this was gradually to replace parts of it, and compare the size and

complexity of the old and new versions. Rewriting code forces one to (re)consider all the problems the ori-ginal author was trying to address. Furthermore, a `paper' redesign would not only be unconvincing, it would do little to provide the facilities, particularly streams, that had been lost by discarding the VAX sys-tem.

A significant constraint was that the system must run SunOS 3.x binaries: we had software such asFramemaker for which we had no source, and programs would also need to be exportable without change. Excluding devices, the larger components of the kernel were: the virtual memory system; the window sys-tem; the 4.2BSD File System; basic kernel functions including fork, exec, signals, and scheduling; the network file system; sockets and buffering, including pipes; virtual file system support; and the networksoftware. I decided to look at a few of them more closely, taking advantage of the experience gained on the VAX. Some degree of improvement was ultimately made in all areas except the Fast File System and the networkside of the socket code. (Eventually both might be replaced.) The CPU scheduling code was replaced by a

variant of a simple yet flexible scheduler developed at the University of Maryland [Str86]. The new system has been in use in various forms for more than a year on several Sun 3/50s and one 3/160in the Department. Depending on configuration, it is currently about 120kbytes smaller than the old (the

Sun window system still takes about 60 kbytes in the current kernel). There is still much scope forimprovement, but even now the system does more, uses memory more efficiently, and is (I claim) better structured. In following sections I shall discuss the changes to the paging system, changes made or currently beingmade to the virtual file system, the installation of the stream IO system, and the implications of all that on

kernel implementation and applications. 2. MEMORY MANAGEMENT The SunOS paging system was completely replaced. The new system uses a `working set' paging schemewith a local paging policy, instead of the previous global one. Some implementation ideas, and also extensive justification for this approach, were found in the published descriptions of the EMAS system [Ree81,Ste80, Ste80]. Two other virtual memory systems, by Richard Miller [Mil84], and by a group at Gould [Akh87], both written for versions of UNIX, were also examined closely (our various implementationsdiffer). My goal was to provide a reasonable implementation of the basic UNIX process model on a machine with virtual memory, not to add new primitives; I wished the result to have very few moving parts.The use of the EMAS scheduling algorithm is perhaps the most novel aspect.

2.1. The original paging algorithm The Berkeley paging system uses a `clock' algorithm to approximate LRU page replacement across the sys-tem. This has undergone some adjustments with time, but the basic scheme is as follows. (See [Lef89] for

precise details.) When a process requires a new page it simply takes a free one from the global page pool assoon as one is available. Pages are made available when processes terminate, but also by the intervention of a system process, the `pageout daemon'. There is a big array, the `core map', with one entry for eachusable page of memory. The pageout daemon cycles through the array at a rate determined by some scheduling parameters. Free pages are skipped. Any other page is examined to see if it can be discarded.

- 3 - The daemon checks the page table entry for the page to see if the `page used' bit is set. If not, the page canbe made available. If it has been modified, it must first be written out (asynchronously). If the page has been referenced, the daemon resets the `used' bit and proceeds to the next core map entry. If the page hasnot been touched the next time round, it will be freed. On systems with large memories, a complete cycle can take some time, and so recent versions have the daemon cycling two pointers through the map: oneclears `used' bits; the other reclaims pages.

In principle, this is a simple and elegant technique. It has been applied quite effectively in a controlledenvironment in a new implementation of file cacheing in UNIX using virtual memory [Bra89]. There are, however, several problems in practice with the algorithm when applied to UNIX processes, at least in itsSunOS 3.5/BSD 4.2 version: the core map must keep a back reference to a page table entry, a process's working set size is not always accurate, and the memory scheduling algorithms are elaborations of the early`swapping' algorithms.

The daemon must be able to locate `the page table entry' for each page. The core map entry for a physicalpage must therefore point back to the page table entry that refers to it, if there is one. Map entries for data and stack pages store a (process, page) pair. Pages of an executable program, though, can be referenced byseveral such entries, which must all be found. The solution adopted is to store (text-table, page) in the core map. Processes using a given executable are linked in a chain from its text table entry. Whenever one ofthe processes' page table entries changes, owing to page in or page out, all others must change to match it. The pageout daemon looks round the list too, in order to calculate the union of used/modify bits for allprocesses referring to the page.

When the page daemon takes a page, it invalidates the page table entry but leaves a reference to the page(the page frame number), so that should the process soon fault on that page it can reclaim the page directly from the free list. Before any process is given a free page, the memory allocator must therefore clear anyreference to it, using the data in the core map in much the same way as the pageout daemon. Just to liven things up, in 4.2BSD the network system can claim pages at interrupt level using the memory allocator, sothat a process's page table entries can be changed by an interrupt! (In practice this is not a problem provided care is taken at the kernel main level, but it is something else to have to reason about when checkingthe code.)

To reduce overhead, the old pageout daemon does not run until the amount of free memory drops below aspecified value. It then cycles round the core map at a rate that depends on the amount of memory left. If it cannot free enough memory quickly enough, the memory scheduler (swapper) is started. The swapper syn-chronously writes to disc clumps of the modified pages from the whole image of each process it selects.

2.2. Simplified paging Some of the complexity in the old paging system results from using a global paging policy, which causespointers to go `the wrong way' for UNIX, where pointers elsewhere almost invariably go from a process to

a shared resource which is given a reference count. For instance, when sharing executable code, pointers gofrom the process to a `text table' which has the reference count. When sharing open files, pointers go from a per-process open file table to a reference-counted system open file table (which has the seek pointer),which points in turn to the file's in-core `inode' (or vnode), which again has a reference count. In contrast, the solution adopted for sharing text pages is adequate in practice, but becomes messier when data pages areto be shared (for instance, in some implementations of `fork').

Pointers, then, should go from the process to memory (the shared resource). Pages can be shared easily ifthere is a reference count for each page. In the new system, a process releases its own pages when it no longer requires them. Periodically, each process examines its page table entries and discards pages thathave not been used since the last such scan. It resets the `used' bit on the remaining pages. Over time, the process tracks its own working set. When a process discards a page during a working set scan or a complete page out, the page is put in either afree page or modified page list depending on whether it has been modified since being read in. Pages on the

modified list are written to disc by a separate system process (the new `pageout daemon'). It does not writepages out immediately, but only when the modified page list grows too long, or when the free list length drops below a specified level. This increases the chance that the page can be reclaimed by its processwithout IO overhead. Pages are returned to the free list after they have been written out. The new pageout daemon is a simple loop: it waits for work, collects pages from the modified list, and writes them out. By itself, use of a working set scan in each running process is not enough to prevent thrashing, since thecombined working sets of all such processes might not fit in memory. The collection of processes in

- 4 - memory must be subject to some global constraint. The most obvious constraint to apply is that before aprocess is loaded there must be enough store to hold its working set. This typically changes with time, however, and so a scheduler is required that can adapt to these changes. The EMAS system has an elegant scheduling system [She74] that was chosen for use in our system. Eachprocess in EMAS is assigned a category, a single number that captures its current resource requirements,

and also determines its CPU and memory priority, working set scan frequency, and CPU quantum size.When loaded by the scheduler, a process is given a page and CPU time allocation determined by its category. It remains loaded and in the CPU despatcher's queues until it exhausts either allocation or goes tosleep. The process must then submit itself for rescheduling.

The scheduler refers to a resource model for the system, encoded in a category table which is indexed byprocess categories. The table can be viewed as a little state machine with transitions from category to category. There are four transitions from each category, labelled `more memory', `more time', `unusedmemory', and `unused time'. The scheduler follows the appropriate path to obtain a new category that suits the process. In this way the scheduler quickly tracks the character of a process as it changes. The rescheduled process may remain loaded if it is runnable, its new allocation fits in the available memory,and no higher priority process is waiting. It must otherwise page itself out, and join a priority queue to wait

for memory. A process sleeping on an external event must likewise page itself out whilst waiting unless thememory it occupies is not in demand. When it wakes, it joins the appropriate memory queue. A memory scheduler process cycles through the priority queues using an auxiliary priority ratio table, loading eachprocess in turn as memory becomes available. (See the EMAS papers for details. Wilkes [Wil75] includes the EMAS scheme in an interesting discussion of paging viewed as a problem in control theory.) The notion of an `external event' is hard to sustain in modern UNIX systems. Instead, the priority of asleep# is assumed to say something about the likely speed of response. When a process sleeps at high priority, it records the unused part of its quantum, adjusts its CPU priority (see below), and enters the sleepqueues as usual. A process sleeping at low priority first checks with the memory scheduler before entering the sleep queues. If its memory is required now, it will page itself out, as in EMAS, but otherwise it putsitself in a priority queue of processes that are `idle' in memory.

The new memory scheduler has a simple loop: it checks the `wait memory' queues in the order given by thepriority ratio table, and once it has found a process, it waits for memory to become available, as lower priority processes page themselves out. It also watches for the arrival of idle processes, and when memory isneeded causes them to page themselves out, starting with low priority processes, as soon as they have been idle `long enough' (a scheduling parameter). The original scheduler had the hardware clock interrupt increment a `time' field in each process descriptorto decide how long a process had been in or out of memory, or sleeping. Berkeley moved this to a function

called periodically by a software clock, but it still required a scan of all living processes. Instead, in thissystem, a process is simply time stamped when it enters any (FIFO) scheduling queue. (Currently the stamp is the time in ticks since system startup.) 2.3. Improving UNIX implementation The previous section gave an overview of the old and new paging systems. Let us look now at some areasin more detail, to compare the complexity of the implementations.

g Fork. The fork call makes a copy of the caller's memory image. As many have observed, this is oftenwasteful, since in most cases the

fork precedes an exec which throws the copy away! Paging everythingin to make the copy will also disrupt tracking of the working set. Fortunately, copying is just the

specification; we are free to implement it in the most efficient way. Many have suggested (and several haveimplemented) sharing memory between parent and child until one of them writes to a page; the write is intercepted, the page is copied, and the write is completed in a private copy of the page. On some machinescopy-on-write is hard or costly to implement, but copy-on-reference can be used instead and should be nearly as effective in the most common case (speeding up fork before exec). Berkeley nevertheless implemented fork by complete copy, and added a new system call vfork instead.The parent passes its page tables to the child. The parent is suspended, marked `no VM', and linked to the

child, which is marked `in vfork'. The child process returns from fork and continues to run in the parent'smemory until

exec or exit, at which time it wakes its parent, and waits. The parent grabs its memory hhhhhhhhhhhhhhhhhh

# This refers to the kernel sleep/wakeup synchronisation system, not to the C library function.

- 5 - back, resets some pointers, and allows the child to proceed. Several parts of the system must check for `invfork' (including the terminal handler!). The system calls

exec and exit must synchronise with (theparent's) fork, and the pageout daemon must follow the vfork links to find the current user of the page ifits true owner is marked `no VM'. Thus

fork is moderately expensive, and vfork is rather odd.

The new system does use copy on write. The fork call copies the swap maps, incrementing the swapreference counts; copies the page tables setting `copy on write' state and read-only access; and increments a

core map reference count for valid pages (a page will not be put in the free list until the reference countgoes to zero). There can then be two kinds of page access faults. On a `write protect' error, a page that is no longer shared can have its permissions restored; a shared page must be copied. In either case the pagebecomes private to the process. (A private page also results from a `page missing' fault if the page is `fill on demand'.) Including the code to implement reference counts on swap pages, no more code is required toimplement copy on write in this system than

vfork took in the old.

g Page theft. In the original system, the pageout daemon wanders round the core map stealing pages. Eachtime, it must check the state of the process owning the page. If it is in a safe state, it temporarily maps in

the process's kernel data area to access the swap maps. If the page being stolen had been modified, the dae-mon mapped it into its own VM and started asynchronous IO.

In the new system, nothing happens `asynchronously' to a process as regards the paging system, since iteither discards its own pages or must queue itself to allow the memory scheduler to do so. It simply does neither if it is in an inconvenient state, or if scheduling policy does not yet require it. The new working setscan routines

pagediscard and refscan come to 51 lines including some optional event tracing calls(except when a page must be freed, no functions are called during a scan), but many more lines of code

have been removed. g Reclaiming pages. In either system, a process might lose or discard a page, then require it again before ithas been taken from the free list for use elsewhere. As mentioned above, the old system leaves the page

table entry non-zero to allow reclaim, but clears it as soon the page is taken for use elsewhere. It also has ahash table, in which pages can be found using the pair (source-vnode,block) as a key.

In his system, Richard Miller [Mil84] had an array of disc block addresses, one per memory page, eachentry containing either zero or the current swap block address of the page. When a page table entry was invalidated, the page frame number was left non-zero. Before reclaiming a page, the system checked thatthe swap block address matched that expected by the current process. No action was necessary if the page was claimed by another process. A similar scheme is used here (the swap address is kept in the core map,for use by the pageout daemon). There is currently no need for a hash table. Each shared image (text) entry provides a `master page table' that tracks the location of executable pages, which might page out from oneframe and page in elsewhere some time later (but which should still be shared).

g Swapping. A swap operation in the old system writes out all the modified pages in a process. There arespecial support routines that find contiguous regions of dirty pages that are then split to fit in contiguous regions of swap space. There are some others that, during page out, find modified memory pages that areadjacent in virtual memory to the one selected by the page out daemon.

In the new system, a swap is just a complete process page out, which simply frees the process's memory;clean pages go on the free list, and dirty ones go on the modified page list. If dirty page table entries are adjacent, and they are freed in sequence, and at the same time, their associated pages will be adjacent on themodified page list. The core map entries for the pages have the swap addresses attached to assist page reclaim. The pageout daemon simply checks adjacent modified page list entries to see if the swap blocksare adjacent and if so writes them out with one IO request. This gives the effect of a `partial swap' at little cost. The same effect is obtained by a process that releases adjacent dirty pages during a working set scan.The cost of a bad scheduling decision (eg, a sleeping process is swapped and immediately wakes up) is reduced, since a process can reclaim pages from the modified page list before they are written. (The per-process data and page tables are paged out and reclaimed in the same way.)

g User control. The use of working sets should help to address some of the problems mentioned byRosenthal [Ros89], since the scheduler can more accurately track a process's memory requirements. Even an overenthusiastic process can be kept within bounds, paging mainly against itself, so that on a single-usermachine for instance, a text editor running in a windowing system at the same time as such a process has a better chance of remaining responsive compared to the old system. Scheduling policy can be controlled in several ways, most simply by changing the category table, althoughStephens et. al. [Ste80] do warn that the effects of doing so are not always easy to predict, and might require

experiment. The scheduling parameters and tables are accessible through special files in the directory

- 6 - /dev/sched, making it easy to give workstation users control over the allocation of resources toprocesses on their machines. Windowing processes can be given a set of categories disjoint from the others, just as EMAS provided different sets of categories for interactive and batch processes. 2.4. Summary The new paging system is just over half the size of the old one. I would claim that the new one is easier toreason about, since the various processes involved interact only when each is in a well defined state. They

co-operate: a process does not barge into another's memory without knocking. 3. FILE SYSTEM TYPES The first `file system switch' appeared in the 8th Edition. Several interesting file system types were imple-mented using it: the network file system /n[Wei84] and the process file system /proc[Kil84]. Others have

followed suit: the Generic File System [Rod86], and the Remote File System [Rif86] use (somewhat moreelaborate) file switches. Sun developed the Virtual File System (VFS) and vnodes [Kle86].

I had implemented my own impression of the 8th Edition file system switch on our VAX. (`My own ...'because very few details had been published about the 8th Edition switch.) When we then moved to Suns, I took several of the older file system types, such as /proc, and tried to adapt them to work in the VFS/vnodesystem. I found it harder to use: it did no work for me! I did produce an acceptable solution to my immediate problems, but the VFS interfaces remain unsatisfactory. Closer study revealed some confusion and limi-tations in its design and implementation. To see why, we must look a little closer at what these systems do.

3.1. A simple UNIX file system switch Adding a file system switch in the 8th Edition style to the 7th Edition system is fairly easy to do. A value isadded to each mount table entry that tells which type of file system is mounted there. That value is used as

an index into a table of file system operations, and the parts of the mount table entry used only by the discfile system are moved elsewhere. I found that extensive changes to the kernel source code could be avoided by keeping the inode as the `generic file', so that references such as ip->i_size continue to work; thedisc-specific information goes elsewhere. This seems reasonable: an inode

is an abstract file. The interfaceto a file system is similar to the interface to character or block device handlers. Kernel functions such as

`read inode' are replaced by something like

void readi(struct inode *ip) {

(*filesw[ip->i_fs->m_type].f_readi)(ip); }

The function bodies are renamed and moved into files associated with the disc file system type; suitableentries are made in the switch table. More work is required to add indirection to file name lookup and when

managing the `inode table' (iget, iput), and one might consider changing the calling conventions to hidefrom the file systems the structure of the per-process data area, but this should give the basic idea.

Within any file system, there have always been different kinds of files: directories, devices, sockets, FIFOs,symbolic links, and the `regular' file. In 7th Edition, each file is tagged with a value telling its kind (IFDIR, IFREG, etc). The internal versions of read and write can be applied to any kind of file. Reading a regularfile yields its data, but reading a symbolic link produces a file name, and reading a directory gives a list of (inode-number, name) pairs in some format. The internal version of write can also write an(inode-number, name) pair into a directory; it thus creates a link.

The same scheme can be used with the file system switch. The implementation of `read' for the file systemwill look at the kind of file to decide what to do. Note that all file systems must agree on a standard format in which they will exchange directory entries with the kernel during read and write. This need not berelated to the directory format that the file system uses internally. For example, /proc makes up directory entries `on the fly' out of the contents of the kernel's process table. It is not necessary to add a special`readdir' or `getdirents' call.

Implementation in this way puts the file system switch below the bulk of the UNIX system call/file systeminterface code. The job of implementing UNIX semantics (pathnames, checking permissions, managing times, inode data structure locking, etc.) is mostly done above that. The file system writer can concentrateon the rewarding job of obtaining the data to make curious things look like files and directories to processes.

- 7 - 3.2. VFS and vnodes At first glance the VFS/vnode system is roughly similar to the above. The VFS and vnode operations struc-tures are different types. The file system independent part of the VFS system implements the file system

name space, including mount points; it interprets path names. Once a particular file system has yielded itsroot vnode to the VFS code when it is mounted in the file system hierarchy, or when it is crossed at a mount point, it plays no further part in the interpretation of operations on files in the file system it represents. Theyare directed through the operation structure attached to the vnode; the vnode operation set is extensive.

Sun's system is avowedly `object oriented': the group of function pointers linking to the operations isattached directly to the data structure or `object' representing the file system (VFS) or the file (vnode). One might then expect each vnode to accept just the operations it implements: the operation structure attached toa `directory' in some file system would accept `lookup', `rename', and other such directory requests, but its entry for `readlink' would point to an error routine. Similarly, a `symbolic link' would reject many otheroperations but accept `readlink'. That is not what happens. In all the file systems in SunOS 3.5, every vnode in a given file system has the same operation set, which accepts any operation that is allowed bysome kind of file in the file system. Each operation then checks a `file kind' tag to see what sort of file it is, just like the other systems. (Indeed, it is sometimes checked twice: at the system call level, and by thevnode operation. The second check is not redundant, since the NFS server takes a shortcut, bypassing the first check by calling vnode operations directly.) As things stand, the operations are not really associatedwith the vnode at all, but with the collection of vnodes inside a particular file system, and the contents of the operation set are known statically. Putting the pointer in the vnode itself is just an optimisation. 3.3. Adding UNIX semantics The VFS/vnode system is said to ``support (but not require) UNIX file system access semantics'' [Kle86].This is accurate; the use of the phrase

`access semantics' is telling. On a system call, the VFS code looksup parameter pathname(s) to obtain a vnode (or two). After that, unlike the older file system switch, VFS in

most cases does little more than change the representation of its parameters and call the vnode routines. Allthe file system types that I wished to add were UNIX file system types, but it appeared each would have to maintain its own UNIX file system state and behaviour in the handling of user and group IDs, permissionschecking, and the like. Scanning the code of the existing file systems confirmed this impression, and also revealed repetitions of significant code sequences. For instance, specfs, nfs and ufs all contained code to doIO through the buffer cache.

Nfs and ufs separately implemented the checks of the access system calls.What a good way to get several (slightly different) copies of UNIX!

On the VAX system, the only file system types were UNIX ones, and so UNIX semantics had been imple-mented just once above the file systems by the system call level. I decided to provide a file system `base implementation' that would support UNIX file system implementations. The kernel continues to operate onvnodes using Sun's interface and operations, although the VFS level was changed to provide vnode locking primitives, as a service to be used as required by the lower levels. Most of the real work is done by a newfile system type.

The unix file type attends to the standard UNIX semantics for files, including maintenance of user andgroup IDs, protections, file mode, file times, size, provision of a basic set of values for

stat (`getattr'), anda basic protection policy. It simplifies implementation of the proc file system, a memory file system (whichallows a directory structure and files to be created in main memory), and some others to be described

presently. It currently does not handle the ufs file system type (the 4.2 disc file system). It also supports several types of anonymous vnodes, file types that lack names in the file system hierarchy:currently

pipe, fifo, socket, and config are included. The pipe and fifo types provide vnodes for all streamIO (not just IO on pipes), the

fifo subtype differing mainly in the open/close protocol. The socket and config types need more explanation.

3.4. Reconsidering socket and device switches In 7th Edition, the system's `open file' structure referred directly to the inode corresponding to that file. In4.2BSD a level of indirection was added, to direct IO on a file descriptor either to the socket IO system or to

the (inode) file system. The set of operations included read/write, close, ioctl and select, precisely a subsetof the vnode operations added later, yet both mechanisms remained in the Sun system after the VFS was added at the next level down. The socket type provides suitable vnodes for IO on sockets, and has replacedthe older two-tier mechanism. Having all files in the system accessible through the file system interface (inode or vnode) makes it more likely that applications can use any kind of file in most contexts.

- 8 - A character device is represented by open, close, read, write, ioctl and select operations selected by indirec-tion through a `device switch'. Since vnodes provide a superset of these operations and the operation set can be unique to the vnode, perhaps the device switch is redundant. The (experimental) config file system can potentially replace the device switches. It provides a device namespace (hierarchy) derived automatically from the machine's configuration. Each file in it has its own set of

operations that refer to the appropriate device handler routines. Multiplexors are represented by directoriesthat generate channel vnodes on demand. (This might serve to eliminate major/minor device numbers from the system, removing the need to make them larger on big systems.) Config differs from specfs (Sun's spe-cial devices file system) in that it provides a mountable name space.

Specfs makes a vnode for a device; config's vnodes are the devices (or pseudo-devices). It is an experimental file system because I am uncer-tain how best to handle the file attributes. Originally in UNIX, all file systems were on disc, saving the permissions, ownership, and file times of device inodes over a reboot. Currently, because config is a virtualhierarchy, without persistent store, attributes on some files must be set by

/etc/rc when the default attri-butes are not acceptable. This might turn out to be the best way.

3.5. Summary I intended not so much to reduce the cost or size of the original implementation but rather to investigate itsstructure and provide a framework in which new (UNIX) file system types can be added easily and cheaply.

The separation of responsibilities between the system call and vnode layers, the split between file systemtypes and file types, and the location of error checking all seem unsatisfactory. It seems ridiculous to me that a UNIX system would provide so little support for UNIX semantics in the file system (the heart ofUNIX) that such things as permissions checking should be implemented several times. I question the value of the current vnode system as a cost-effective tool when building UNIX file system implementations. 4. STREAMS 4.1. Streams (and STREAMS) The stream IO system was originally developed by Dennis Ritchie for the 8th Edition. [Rit84]. It was lateradopted by System V. [Dun86]. I implemented a stream IO system on our VAX in 1985, because we were

installing various kinds of communications and networking facilities, and I thought stream IO would help tomake sense of it all. Sockets would also have been available then, but streams had the advantage that it was much more than just a networking system. Stream IO is an elegant abstraction which is also extremely use-ful! What characteristics appealed to me?

g a stream is a full-duplex buffered channel into the kernel g a stream contains a sequence of bytes punctuated by delimiters g data is transformed only by processing modules on the stream g data is read and written using read and write g there are essentially no options These help to make streams neutral in their application in the same way that any data can be written in anyformat to a UNIX file, or sent through a UNIX pipeline. By contrast, sockets provide options for

datagrams, streams, sequenced packets, and a `raw' interface: they clearly provide a networking interface. There are two ways of dealing with networks in UNIX using streams: make the complex network look likea collection of simple streams [Pre85, Pre86], or change the stream system to look more like a network.

AT&T took the latter approach when the stream IO system became ``STREAMS technology'' in System V.It acquired some new features: multiplexor drivers; the contents of a stream is now a sequence of messages, which are themselves sequences of bytes; and there are now options at the stream head, including IO modes.IO modes include `message discard', `message non-discard', and `byte stream': they provide different ways of reading the data on the stream. The result is something of a cross between the original stream system and sockets. The basic STREAMSimplementation is consequently somewhat larger than mine: the support for streams, queues, and buffering

takes 29,640 bytes in SunOS 4.0, compared to my system's 9,180 bytes. (The STREAMS figure does notinclude support added recently for independent `out of band' message sequences within one stream [Rag89].)

- 9 - 4.2. Streams in SunOS Last year, I installed a revised version of my streams system on our Sun system. Ritchie found that his ker-nel became appreciably simpler, smaller, and more useful when streams replaced some character devices

and line disciplines. I rather hoped that might happen here. First, I replaced pipes by full-duplex streams,and FIFOs (named pipes) by half-duplex streams. Then, I began to convert character IO device drivers to streams. I wrote a tty editing module, based on Jim McKie's `EUUG' tty driver. Some user-level programswere changed either to take advantage of the new structure, or to do work that the kernel once did. For example, /etc/init and /etc/getty push the tty module onto a terminal stream after opening it.Berkeley pseudo-terminals were discarded, and programs that used them were modified to use the appropriate streams mechanisms instead. After each stage the new kernel was indeed smaller, even though some new functionality had been added(for instance, `message' and `connection' modules) [Pre85]. Certainly, one hopes for this result when

replacing an accumulation of many special cases and conventions by something with a clean, generalmodel. Much of the saving comes from eliminating device and line discipline `read' and `write' interface routines (the streams code does the work). Sun added STREAMS to SunOS 4.0. In every case the comparable STREAMS versions have grownslightly larger, not smaller, than the components they replace. (The compatibility module

ttcompat issmaller than the old tty driver, but it simply translates ioctl calls for the new tty module ldterm, whichis larger.) The increase in the size of the tty modules on conversion to STREAMS is possibly because the

new ones implement all the System V conventions. Unfortunately, some of those were artifacts of the oldcharacter IO system, and were overdue for replacement.

A good example is the MIN/TIME facility of the old terminal driver [Mic88]. It takes a full A4 page oftypeset text to describe the effect of different combinations of

MIN and TIME values, and the restrictions!One case provided buffering on a bursty asynchronous line (eg, for uucp): the driver waits until characters

stop arriving frequently enough or a minimum number accumulates. Another allowed a process to waituntil at least one character arrived or a timer expired. The first case could now be handled by pushing onto the stream a buffering module similar to bufld(4) in the Ninth Edition [Lab86], which can be applied to anystream, not just one with tty editing. (In fact, some programs that currently use

MIN/TIME do not want ttyediting.) The second looks like an application for poll or select. Instead, new mechanisms were addedat the stream head to support the feature.

Streams also allow (indeed encourage) modules to be composed freely. Thus with streams, a Berkeleypseudo-terminal can be replaced by a pipe with message and tty modules installed appropriately. Neither module knows of the other's existence. (We implement simple text windows on our Sun 3/50s in this way;a more useful `editing' shell window needs only the message module.) Of course, the particular application program must be revised accordingly, since the interfaces are radically different. In SunOS, the pseudo-terminals were instead `converted' to streams, but their rather intricate nature required the use of an `intrastream' protocol between the pseudo-terminal parts and ldterm. 4.3. Conclusion There is no point in trying blindly to reproduce or even to emulate all of the old mechanisms using thestream primitives, since the advantages of streams are quickly diminished, and the original point of adding

them lost. `Conversion' to streams provides a good opportunity to retire some of the older ideas that haveseen their best days (when they were the only way to do something). For instance, the Berkeley pseudoterminals could have been left in the system just as they were for compatibility, for a time, while peopleconverted their programs to use the new streams facilities suitably. Eventually we could toss the old rubbish away, and be left with a clean, coherent system. Instead, we end up with new bugs in the old things, butold bugs in the new things!

5. RE-IMPLEMENTATION A common theme of the paging and streams sections above is `re-implementation'. Many versions ofUNIX were mainly ported and adapted to virtual memory machines, not thoroughly re-implemented. What

is the distinction?

By re-implementation, we mean re-programming and altering the structure of an existing sys-tem where necessary, so as to better implement the function of that system. We distinguish reimplementation from two other activities. These are the transportation of an existing system tonew hardware, and the design of a new system. If we transport a system, we leave its function

- 10 - and its structure unaltered. Re-implementation is not merely a combination of transportationand re-design, but is a unique activity in its own right.

It is doing it again, but very well this time.[She77] [my italics]

Re-implementation treats the old version of the system as a prototype of the desired system. Functions maybe redistributed, added, or dropped. At no point, however, is the intent to provide a base for a completely

different system: all the mechanisms are honed to implement precisely the key ideas of the original system. For example, Andrew Hume's mk[Hum87] is a very good re-implementation of make: the underlyingmodel and syntax of the two programs are obviously closely related, but in the new program the old ideas

have been carefully refined, and it interacts more smoothly with the surrounding environment. In contrast,both Mach [Acc86] and Amoeba [Mul89] are new systems that can emulate the UNIX environment well enough to run UNIX programs, but the UNIX interface is not necessarily the best way to use them. I think the essential ideas of UNIX are still worthwhile, but I also think it will take a good re-implementation to bring new life to them.

6. CONCLUSIONS Although recent versions of UNIX have finally adopted some of the clever ideas from `research' and else-where, this evidently has not resulted in a great reduction in system size or complexity. The new mechanisms have been modified (or mangled) to emulate the clumsy features they were intended to replace. It isimportant to keep a sense of proportion when deciding how much `backwards compatibility' to maintain.

It is dangerous to place too much hope in any improvement coming from just following new fashions, if welack insight into what really went wrong before. Without that insight, I suspect that rewriting UNIX in C++, for example, could easily become an excuse for increasing complexity (because by using C++ `we canhandle more complexity').

The system we use has been put together using a traditional UNIX approach: make a careful selection ofmechanisms, and if possible, steal good ideas from elsewhere, but ensure they work together and fit the underlying system structure. The result is a more powerful system which is less greedy for memory (andmy time). It still leaves room for Thompson's ``much-needed gap'' [Tho75].

Acknowledgements Discussions with Martin Atkins about the state of UNIX and software in general have been particularlyprofitable, and he made very helpful comments on drafts of this paper. Martin, Peter Whysall and several

others in the Department have been `friendly users' of this software, made many good suggestions, andhave given much encouragement. Dennis Ritchie very kindly sent a copy of the Ninth Edition manual.

References Acc86. M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian and M. Young, ``Mach: ANew Kernel Foundation for UNIX Development'',

USENIX Association Conference Proceedings,Atlanta, Georgia, pp. 93-113 (Summer 1986).

Akh87. P. Akhtar, ``A Replacement for Berkeley Memory Management'', USENIX Association Conference Proceedings, Phoenix, Arizona, pp. 69-80 (Summer 1987).

Bra89. A. Braunstein, M. Riley and J. Wilkes, ``Improving the efficiency of UNIX file buffer caches'',

Operating Systems Review 23(5), pp. 71-82 (1989).

Dun86. B. Duncanson, K. F. Storm and A. Facius, ``EUUG Technical Session Report: The Streams facilityin UNIX System V rel. 3'',

European UNIX User Group Newsletter 6(1), p. 13, European UNIX Sys-tem User Group Meeting Bella Center, Copenhagen - 10th - 13th September 1985 (Spring 1986).

Hum87. A. Hume, ``Mk: A Successor to Make'', USENIX Association Conference Proceedings, Phoenix,Arizona, pp. 445-458 (Summer 1987). Kil84. T. J. Killian, ``Processes as Files'', USENIX Association Conference Proceedings, Salt Lake City,Utah, pp. 203-207 (Summer 1984). Kle86. S. R. Kleiman, ``Vnodes: An Architecture for Multiple File System Types in Sun UNIX'', USENIX

Association Conference Proceedings, Atlanta, Georgia, pp. 238-247 (Summer 1986).

Lab86. AT&T Bell Laboratories, in UNIX Programmer's Manual, Ninth Edition, Volume One, ed. M. D.McIlroy, Murray Hill, New Jersey (September 1986).

- 11 - Lef89. S. J. Leffler, M. K. McKusick, M. J. Karels and J. S. Quarterman, The Design and Implementation

of the 4.3BSD UNIX Operating System, Addison-Wesley, New York (1989).

Mic88. Sun Microsystems, pp. 1307 in UNIX Interface Reference Manual, SunOS 4.0, Mountain View,California (May 1988).

Mil84. R. Miller, ``A Demand Paging Virtual Memory Manager for System V'', USENIX Association

Conference Proceedings, Salt Lake City, Utah, pp. 178-182 (Summer 1984).

Mul89. S. J. Mullender, ``Amoeba - High-Performance Distributed Computing'', European UNIX Systems

User Group Spring Conference, Brussels, Belgium, pp. 17-26 (April 1989).

Pik83. R. Pike, ``Graphics in Overlapping Bitmap Layers'', ACM Transactions on Graphics 2(2), pp. 135-160 (April 1983).

Pik84. R. Pike, ``The Blit: A Multiplexed Graphics Terminal'', AT & T Bell Laboratories Technical Journal 63(8 part 2), pp. 1607-1632 (October 1984).

Pre85. D. L. Presotto and D. M. Ritchie, ``Interprocess Communication in the 8th Edition UNIX System'',

USENIX Association Conference Proceedings, Portland, Oregon, pp. 309-316 (Summer 1985).

Pre86. D. L. Presotto, ``The Eighth Edition Unix Connection Service'', European UNIX Systems User

Group Autumn Conference, Florence, Italy, pp. 1-9 (April 1986).

Rag89. Stephen Rago, ``Out-Of-Band Communications in STREAMS'', Proc. 1989 Summer USENIX

Conference, San Francisco, California (June 1989).

Ree81. D. J. Rees, ``The Structure of the EMAS 2900 Kernel'', CSR-91-81, Unversity of Edinburgh, Deptof Computer Science (August 1981).

Rif86. A. P. Rifkin, M. P. Forbes, R. L. Hamilton, M. Sabrio, S. Shah and K. Yueh, ``RFS ArchitecturalOverview'',

European UNIX User Group Newsletter 6(2), pp. 13-23 (Summer 1986).

Rit84. D. M. Ritchie, ``A Stream Input-Output System'', AT & T Bell Laboratories Technical Journal 63(8part 2), pp. 1897-1910 (October 1984).

Rod86. R. Rodriguez, M. Koehler and R. Hyde, ``The Generic File System'', USENIX Association Conference Proceedings, Atlanta, Georgia, pp. 260-269 (Summer 1986).

Ros89. D. Rosenthal, ``More Haste, Less Speed'', European UNIX Systems User Group Spring Conference, Brussels, Belgium, pp. 123-130 (April 1989).

She74. N. A. Shelness, P. D. Stephens and H. Whitfield, ``The Edinburgh Multi-Access System Schedulingand Allocation Procedures in theResident Supervisor'', EMAS Report 4, University of Edinburgh,

Department of Computer Science (April 1974). She77. N. H. Shelness, D. J. Rees, P. D. Stephens and J. K. Yarwood, ``An Experiment in Doing it Again,But Very Well This Time'', CSR-18-77, Unversity of Edinburgh, Dept of Computer Science

(December 1977). Ste80. P. D. Stephens, J. K. Yarwood, D. J. Rees and N. H. Shelness, D. J. Rees and P. D. Stephens, ``TheKernel of the EMAS 2900 Operating System'',

SOFTWARE - Practice and Experience 12(7),pp. 655-668 (July 1982).

Str86. J. H. Straathof, A. K. Thareja and A. K. Agrawala, ``UNIX Scheduling for Large Systems'',

USENIX Association Conference Proceedings, Denver, Colorado, pp. 111-139 (Winter 1986).

Tho75. K. Thompson, ``The UNIX Command Language'', pp. 375-384 in Structured Programming--

Infotech State of the Art Report, Infotech International Ltd., Nicholson House, Maidenhead,Berkshire, England (March 1975).

Wei84. P. J. Weinberger, ``The Version 8 Network File System (Abstract)'', USENIX Association Conference Proceedings, Salt Lake City, Utah, p. 86 (Summer 1984).