More Taste: Less Greed? or Sending UNIX to the Fat Farm

This is the html version of the file http://www.caldo.demon.co.uk/doc/taste.pdf.
G o o g l e automatically generates html versions of documents as we crawl the web.
To link to or bookmark this page, use the following url: http://www.google.com/search?q=cache:boKIbJElRIIC:www.caldo.demon.co.uk/doc/taste.pdf&hl=en&ie=UTF-8

Google is not affiliated with the authors of this page nor responsible for its content.

Page 1

More Taste: Less Greed?

Sending UNIX to the Fat Farm

C H Forsyth

Department of Computer Science

University of York

Heslington

York YO1 5DD

England

+44 904 432753

forsyth@minster.york.ac.uk

ABSTRACT

You (like us) have 80 to 90 Sun 3/50 machines with 4 megabytes of memory. You

have been given some optical discs containing System V.4. Which can you least afford to

discard? Things are getting out of hand. Soon, 32 megabyte Ultrasparks will be needed

just for us to start the window system in under a minute.

For UNIX, now in middle-age rotundly recalling its sprightly youth, mere `tuning' will

not cause that heavy code to slip away. We need to reconsider and re-implement the sys-

tem interface periodically, to take account of changes in its environment. We must be

willing to

throw things away

, discarding parts of the older implementation completely,

rather than corrupting clean new mechanisms to approximate the mistakes of the past.

To illustrate this thesis, I shall discuss work I have done on SunOS 3.5 to reduce its size

and complexity. The virtual memory system has been replaced by a simpler one using

ideas from the EMAS system and elsewhere. A stream IO system in the 8th/9th Edition

style has been added, replacing the old character IO system, pipes, and UNIX domain

sockets. I have also made some preliminary forays into the Virtual File System interface.

1. INTRODUCTION

Many papers describe various ways of dealing with the complexity of `programming in the large'. There

are fewer that take a large program apart to try to find out why it is large, how it grew, and whether it can be

made rather smaller. The commonly available versions of the UNIX system have been growing relentlessly

with each new release and each new feature; the `standard' UNIX kernel is now a moderately large pro-

gram, surrounded by very large applications. I have watched it become much harder to understand, main-

tain, and modify. This troubles me directly (since I provide system software support), and so over the past

few years I have made various attempts, very much part-time, to recode or rewrite parts of the system to be

smaller and simpler. The work began on a VAX with the 3rd Berkeley Distribution, and continues on the

Sun 3 series with SunOS 3.5. The changes have been growing more extensive with time.

Here, I shall pick examples from various parts of the system to show how certain designs match the struc-

ture of UNIX well, and thus make things easier; and conversely, where there is a serious clash, the imple-

mentation becomes much more complicated and the facilities less useful. It can also be less painful in the

end to replace some older things completely, rather than pile hack upon hack. I claim it is a mistake to stan-

dardise (or to have standardised) certain parts of the system interface in their current form.

Background

The UNIX kernel I ran until 1987 on our VAX machines was a `re-port' of 7th Edition to the VAX, but it

included: a recoded 3BSD paging system, many devices, an X.25 interface, and my own implementations of

ideas from 8th Edition UNIX (which was not generally available), specifically the file system switch

[Wei84],

/proc

[Kil84], and a `stream' IO system [Rit84]. The latter included pseudo-terminals, the mes-

sage and connection disciplines, and stream drivers and modules that provided some of the Ethernet and

Internet protocols:

deuna

arp

, and

udp

. There were two peculiar special files:

/dev/time

and

/dev/wait

By the standards then prevailing the kernel was acceptably small:

Page 2

- 2 -

/unix: 95280+20400+146252 = 261932b = 0x3ff2cb

(All kernel tables are included in the data and bss sizes; IO buffers are not.) Starting with `layers' [Pik83]

code by Simon Kenyon, Nigel Roles wrote an implementation of Pike's window system

mpx

[Pik84] that

was good enough to run programs such as the text editor

jim

for the Atari ST. We had a promising

environment.

In 1987, however, we decided for financial reasons to switch to Suns (discless 3/50s supported by file

servers). I was disconcerted, not to say disheartened, to find that the Sun kernel code was about 390 kbytes,

even properly configured (as distributed it was 490k). It had notionally been built on 4.2 BSD, but the

changes to support the Virtual File System and `vnodes' [Kle86] were extensive, although the character IO

system and socket systems were largely unaffected.

Investigation

I decided to discover why the system was so much larger and (apparently) complicated than previous UNIX

systems. I thought that a good way to do this was gradually to replace parts of it, and compare the size and

complexity of the old and new versions. Rewriting code forces one to (re)consider all the problems the ori-

ginal author was trying to address. Furthermore, a `paper' redesign would not only be unconvincing, it

would do little to provide the facilities, particularly streams, that had been lost by discarding the VAX sys-

tem.

A significant constraint was that the system must run SunOS 3.x binaries: we had software such as

Framemaker for which we had no source, and programs would also need to be exportable without change.

Excluding devices, the larger components of the kernel were: the virtual memory system; the window sys-

tem; the 4.2BSD File System; basic kernel functions including fork, exec, signals, and scheduling; the net-

work file system; sockets and buffering, including pipes; virtual file system support; and the network

software. I decided to look at a few of them more closely, taking advantage of the experience gained on the

VAX.

Some degree of improvement was ultimately made in all areas except the Fast File System and the network

side of the socket code. (Eventually both might be replaced.) The CPU scheduling code was replaced by a

variant of a simple yet flexible scheduler developed at the University of Maryland [Str86].

The new system has been in use in various forms for more than a year on several Sun 3/50s and one 3/160

in the Department. Depending on configuration, it is currently about 120kbytes smaller than the old (the

Sun window system still takes about 60 kbytes in the current kernel). There is still much scope for

improvement, but even now the system does more, uses memory more efficiently, and is (I claim) better

structured.

In following sections I shall discuss the changes to the paging system, changes made or currently being

made to the virtual file system, the installation of the stream IO system, and the implications of all that on

kernel implementation and applications.

2. MEMORY MANAGEMENT

The SunOS paging system was completely replaced. The new system uses a `working set' paging scheme

with a local paging policy, instead of the previous global one. Some implementation ideas, and also exten-

sive justification for this approach, were found in the published descriptions of the EMAS system [Ree81,

Ste80, Ste80]. Two other virtual memory systems, by Richard Miller [Mil84], and by a group at Gould

[Akh87], both written for versions of UNIX, were also examined closely (our various implementations

differ). My goal was to provide a reasonable implementation of the basic UNIX process model on a

machine with virtual memory, not to add new primitives; I wished the result to have very few moving parts.

The use of the EMAS scheduling algorithm is perhaps the most novel aspect.

2.1. The original paging algorithm

The Berkeley paging system uses a `clock' algorithm to approximate LRU page replacement across the sys-

tem. This has undergone some adjustments with time, but the basic scheme is as follows. (See [Lef89] for

precise details.) When a process requires a new page it simply takes a free one from the global page pool as

soon as one is available. Pages are made available when processes terminate, but also by the intervention of

a system process, the `pageout daemon'. There is a big array, the `core map', with one entry for each

usable page of memory. The pageout daemon cycles through the array at a rate determined by some

scheduling parameters. Free pages are skipped. Any other page is examined to see if it can be discarded.

Page 3

- 3 -

The daemon checks the page table entry for the page to see if the `page used' bit is set. If not, the page can

be made available. If it has been modified, it must first be written out (asynchronously). If the page has

been referenced, the daemon resets the `used' bit and proceeds to the next core map entry. If the page has

not been touched the next time round, it will be freed. On systems with large memories, a complete cycle

can take some time, and so recent versions have the daemon cycling two pointers through the map: one

clears `used' bits; the other reclaims pages.

In principle, this is a simple and elegant technique. It has been applied quite effectively in a controlled

environment in a new implementation of file cacheing in UNIX using virtual memory [Bra89]. There are,

however, several problems in practice with the algorithm when applied to UNIX processes, at least in its

SunOS 3.5/BSD 4.2 version: the core map must keep a back reference to a page table entry, a process's

working set size is not always accurate, and the memory scheduling algorithms are elaborations of the early

`swapping' algorithms.

The daemon must be able to locate `the page table entry' for each page. The core map entry for a physical

page must therefore point back to the page table entry that refers to it, if there is one. Map entries for data

and stack pages store a (process, page) pair. Pages of an executable program, though, can be referenced by

several such entries, which must all be found. The solution adopted is to store (text-table, page) in the core

map. Processes using a given executable are linked in a chain from its text table entry. Whenever one of

the processes' page table entries changes, owing to page in or page out, all others must change to match it.

The pageout daemon looks round the list too, in order to calculate the union of used/modify bits for all

processes referring to the page.

When the page daemon takes a page, it invalidates the page table entry but leaves a reference to the page

(the page frame number), so that should the process soon fault on that page it can reclaim the page directly

from the free list. Before any process is given a free page, the memory allocator must therefore clear any

reference to it, using the data in the core map in much the same way as the pageout daemon. Just to liven

things up, in 4.2BSD the network system can claim pages at interrupt level using the memory allocator, so

that a process's page table entries can be changed by an interrupt! (In practice this is not a problem pro-

vided care is taken at the kernel main level, but it is something else to have to reason about when checking

the code.)

To reduce overhead, the old pageout daemon does not run until the amount of free memory drops below a

specified value. It then cycles round the core map at a rate that depends on the amount of memory left. If it

cannot free enough memory quickly enough, the memory scheduler (swapper) is started. The swapper syn-

chronously writes to disc clumps of the modified pages from the whole image of each process it selects.

2.2. Simplified paging

Some of the complexity in the old paging system results from using a global paging policy, which causes

pointers to go `the wrong way' for UNIX, where pointers elsewhere almost invariably go from a process to

a shared resource which is given a reference count. For instance, when sharing executable code, pointers go

from the process to a `text table' which has the reference count. When sharing open files, pointers go from

a per-process open file table to a reference-counted system open file table (which has the seek pointer),

which points in turn to the file's in-core `inode' (or vnode), which again has a reference count. In contrast,

the solution adopted for sharing text pages is adequate in practice, but becomes messier when data pages are

to be shared (for instance, in some implementations of `fork').

Pointers, then, should go from the process to memory (the shared resource). Pages can be shared easily if

there is a reference count for each page. In the new system, a process releases its own pages when it no

longer requires them. Periodically, each process examines its page table entries and discards pages that

have not been used since the last such scan. It resets the `used' bit on the remaining pages. Over time, the

process tracks its own working set.

When a process discards a page during a working set scan or a complete page out, the page is put in either a

free page or modified page list depending on whether it has been modified since being read in. Pages on the

modified list are written to disc by a separate system process (the new `pageout daemon'). It does not write

pages out immediately, but only when the modified page list grows too long, or when the free list length

drops below a specified level. This increases the chance that the page can be reclaimed by its process

without IO overhead. Pages are returned to the free list after they have been written out. The new pageout

daemon is a simple loop: it waits for work, collects pages from the modified list, and writes them out.

By itself, use of a working set scan in each running process is not enough to prevent thrashing, since the

combined working sets of all such processes might not fit in memory. The collection of processes in

Page 4

- 4 -

memory must be subject to some global constraint. The most obvious constraint to apply is that before a

process is loaded there must be enough store to hold its working set. This typically changes with time,

however, and so a scheduler is required that can adapt to these changes.

The EMAS system has an elegant scheduling system [She74] that was chosen for use in our system. Each

process in EMAS is assigned a category, a single number that captures its current resource requirements,

and also determines its CPU and memory priority, working set scan frequency, and CPU quantum size.

When loaded by the scheduler, a process is given a page and CPU time allocation determined by its

category. It remains loaded and in the CPU despatcher's queues until it exhausts either allocation or goes to

sleep. The process must then submit itself for rescheduling.

The scheduler refers to a resource model for the system, encoded in a

category table

which is indexed by

process categories. The table can be viewed as a little state machine with transitions from category to

category. There are four transitions from each category, labelled `more memory', `more time', `unused

memory', and `unused time'. The scheduler follows the appropriate path to obtain a new category that suits

the process. In this way the scheduler quickly tracks the character of a process as it changes.

The rescheduled process may remain loaded if it is runnable, its new allocation fits in the available memory,

and no higher priority process is waiting. It must otherwise page itself out, and join a priority queue to wait

for memory. A process sleeping on an external event must likewise page itself out whilst waiting unless the

memory it occupies is not in demand. When it wakes, it joins the appropriate memory queue. A memory

scheduler process cycles through the priority queues using an auxiliary priority ratio table, loading each

process in turn as memory becomes available. (See the EMAS papers for details. Wilkes [Wil75] includes

the EMAS scheme in an interesting discussion of paging viewed as a problem in control theory.)

The notion of an `external event' is hard to sustain in modern UNIX systems. Instead, the priority of a

sleep is assumed to say something about the likely speed of response. When a process sleeps at high prior-

ity, it records the unused part of its quantum, adjusts its CPU priority (see below), and enters the sleep

queues as usual. A process sleeping at low priority first checks with the memory scheduler before entering

the sleep queues. If its memory is required now, it will page itself out, as in EMAS, but otherwise it puts

itself in a priority queue of processes that are `idle' in memory.

The new memory scheduler has a simple loop: it checks the `wait memory' queues in the order given by the

priority ratio table, and once it has found a process, it waits for memory to become available, as lower prior-

ity processes page themselves out. It also watches for the arrival of idle processes, and when memory is

needed causes them to page themselves out, starting with low priority processes, as soon as they have been

idle `long enough' (a scheduling parameter).

The original scheduler had the hardware clock interrupt increment a `time' field in each process descriptor

to decide how long a process had been in or out of memory, or sleeping. Berkeley moved this to a function

called periodically by a software clock, but it still required a scan of all living processes. Instead, in this

system, a process is simply time stamped when it enters any (FIFO) scheduling queue. (Currently the stamp

is the time in ticks since system startup.)

2.3. Improving UNIX implementation

The previous section gave an overview of the old and new paging systems. Let us look now at some areas

in more detail, to compare the complexity of the implementations.

Fork.

The

fork

call makes a copy of the caller's memory image. As many have observed, this is often

wasteful, since in most cases the

fork

precedes an

exec

which throws the copy away! Paging everything

in to make the copy will also disrupt tracking of the working set. Fortunately, copying is just the

specification; we are free to implement it in the most efficient way. Many have suggested (and several have

implemented) sharing memory between parent and child until one of them writes to a page; the write is

intercepted, the page is copied, and the write is completed in a private copy of the page. On some machines

copy-on-write is hard or costly to implement, but copy-on-reference can be used instead and should be

nearly as effective in the most common case (speeding up

fork

before

exec

Berkeley nevertheless implemented

fork

by complete copy, and added a new system call

vfork

instead.

The parent passes its page tables to the child. The parent is suspended, marked `no VM', and linked to the

child, which is marked `in vfork'. The child process returns from

fork

and continues to run in the parent's

memory until

exec

exit

, at which time it wakes its parent, and waits. The parent grabs its memory

This refers to the kernel sleep/wakeup synchronisation system, not to the C library function.

Page 5

- 5 -

back, resets some pointers, and allows the child to proceed. Several parts of the system must check for `in

vfork' (including the terminal handler!). The system calls

exec

and

exit

must synchronise with (the

parent's)

fork

, and the pageout daemon must follow the vfork links to find the current user of the page if

its true owner is marked `no VM'. Thus

fork

is moderately expensive, and

vfork

is rather odd.

The new system does use copy on write. The

fork

call copies the swap maps, incrementing the swap

reference counts; copies the page tables setting `copy on write' state and read-only access; and increments a

core map reference count for valid pages (a page will not be put in the free list until the reference count

goes to zero). There can then be two kinds of page access faults. On a `write protect' error, a page that is

no longer shared can have its permissions restored; a shared page must be copied. In either case the page

becomes private to the process. (A private page also results from a `page missing' fault if the page is `fill

on demand'.) Including the code to implement reference counts on swap pages, no more code is required to

implement copy on write in this system than

vfork

took in the old.

Page theft.

In the original system, the pageout daemon wanders round the core map stealing pages. Each

time, it must check the state of the process owning the page. If it is in a safe state, it temporarily maps in

the process's kernel data area to access the swap maps. If the page being stolen had been modified, the dae-

mon mapped it into its own VM and started asynchronous IO.

In the new system, nothing happens `asynchronously' to a process as regards the paging system, since it

either discards its own pages or must queue itself to allow the memory scheduler to do so. It simply does

neither if it is in an inconvenient state, or if scheduling policy does not yet require it. The new working set

scan routines

pagediscard

and

refscan

come to 51 lines including some optional event tracing calls

(except when a page must be freed, no functions are called during a scan), but many more lines of code

have been removed.

Reclaiming pages.

In either system, a process might lose or discard a page, then require it again before it

has been taken from the free list for use elsewhere. As mentioned above, the old system leaves the page

table entry non-zero to allow reclaim, but clears it as soon the page is taken for use elsewhere. It also has a

hash table, in which pages can be found using the pair (source-vnode,block) as a key.

In his system, Richard Miller [Mil84] had an array of disc block addresses, one per memory page, each

entry containing either zero or the current swap block address of the page. When a page table entry was

invalidated, the page frame number was left non-zero. Before reclaiming a page, the system checked that

the swap block address matched that expected by the current process. No action was necessary if the page

was claimed by another process. A similar scheme is used here (the swap address is kept in the core map,

for use by the pageout daemon). There is currently no need for a hash table. Each shared image (text) entry

provides a `master page table' that tracks the location of executable pages, which might page out from one

frame and page in elsewhere some time later (but which should still be shared).

Swapping.

A swap operation in the old system writes out all the modified pages in a process. There are

special support routines that find contiguous regions of dirty pages that are then split to fit in contiguous

regions of swap space. There are some others that, during page out, find modified memory pages that are

adjacent in virtual memory to the one selected by the page out daemon.

In the new system, a swap is just a complete process page out, which simply frees the process's memory;

clean pages go on the free list, and dirty ones go on the modified page list. If dirty page table entries are

adjacent, and they are freed in sequence, and at the same time, their associated pages will be adjacent on the

modified page list. The core map entries for the pages have the swap addresses attached to assist page

reclaim. The pageout daemon simply checks adjacent modified page list entries to see if the swap blocks

are adjacent and if so writes them out with one IO request. This gives the effect of a `partial swap' at little

cost. The same effect is obtained by a process that releases adjacent dirty pages during a working set scan.

The cost of a bad scheduling decision (eg, a sleeping process is swapped and immediately wakes up) is

reduced, since a process can reclaim pages from the modified page list before they are written. (The per-

process data and page tables are paged out and reclaimed in the same way.)

User control.

The use of working sets should help to address some of the problems mentioned by

Rosenthal [Ros89], since the scheduler can more accurately track a process's memory requirements. Even

an overenthusiastic process can be kept within bounds, paging mainly against itself, so that on a single-user

machine for instance, a text editor running in a windowing system at the same time as such a process has a

better chance of remaining responsive compared to the old system.

Scheduling policy can be controlled in several ways, most simply by changing the category table, although

Stephens et. al. [Ste80] do warn that the effects of doing so are not always easy to predict, and might require

experiment. The scheduling parameters and tables are accessible through special files in the directory

Page 6

- 6 -

/dev/sched

, making it easy to give workstation users control over the allocation of resources to

processes on their machines. Windowing processes can be given a set of categories disjoint from the others,

just as EMAS provided different sets of categories for interactive and batch processes.

2.4. Summary

The new paging system is just over half the size of the old one. I would claim that the new one is easier to

reason about, since the various processes involved interact only when each is in a well defined state. They

co-operate: a process does not barge into another's memory without knocking.

3. FILE SYSTEM TYPES

The first `file system switch' appeared in the 8th Edition. Several interesting file system types were imple-

mented using it: the network file system

[Wei84] and the process file system

/proc

[Kil84]. Others have

followed suit: the Generic File System [Rod86], and the Remote File System [Rif86] use (somewhat more

elaborate) file switches. Sun developed the Virtual File System (VFS) and vnodes [Kle86].

I had implemented my own impression of the 8th Edition file system switch on our VAX. (`My own ...'

because very few details had been published about the 8th Edition switch.) When we then moved to Suns, I

took several of the older file system types, such as

/proc

, and tried to adapt them to work in the VFS/vnode

system. I found it harder to use: it did no work for me! I did produce an acceptable solution to my immedi-

ate problems, but the VFS interfaces remain unsatisfactory. Closer study revealed some confusion and limi-

tations in its design and implementation. To see why, we must look a little closer at what these systems do.

3.1. A simple UNIX file system switch

Adding a file system switch in the 8th Edition style to the 7th Edition system is fairly easy to do. A value is

added to each mount table entry that tells which type of file system is mounted there. That value is used as

an index into a table of file system operations, and the parts of the mount table entry used only by the disc

file system are moved elsewhere. I found that extensive changes to the kernel source code could be avoided

by keeping the

inode

as the `generic file', so that references such as

ip->i_size

continue to work; the

disc-specific information goes elsewhere. This seems reasonable: an inode

an abstract file. The interface

to a file system is similar to the interface to character or block device handlers. Kernel functions such as

`read inode' are replaced by something like

void readi(struct inode *ip)

{

(*filesw[ip->i_fs->m_type].f_readi)(ip);

}

The function bodies are renamed and moved into files associated with the disc file system type; suitable

entries are made in the switch table. More work is required to add indirection to file name lookup and when

managing the `inode table' (iget, iput), and one might consider changing the calling conventions to hide

from the file systems the structure of the per-process data area, but this should give the basic idea.

Within any file system, there have always been different kinds of files: directories, devices, sockets, FIFOs,

symbolic links, and the `regular' file. In 7th Edition, each file is tagged with a value telling its kind (IFDIR,

IFREG, etc). The internal versions of read and write can be applied to any kind of file. Reading a regular

file yields its data, but reading a symbolic link produces a file name, and reading a directory gives a list of

(inode-number, name) pairs in some format. The internal version of write can also

write

(inode-number, name) pair into a directory; it thus creates a link.

The same scheme can be used with the file system switch. The implementation of `read' for the file system

will look at the kind of file to decide what to do. Note that all file systems must agree on a standard format

in which they will exchange directory entries with the kernel during read and write. This need not be

related to the directory format that the file system uses internally. For example,

/proc

makes up directory

entries `on the fly' out of the contents of the kernel's process table. It is not necessary to add a special

`readdir' or `getdirents' call.

Implementation in this way puts the file system switch below the bulk of the UNIX system call/file system

interface code. The job of implementing UNIX semantics (pathnames, checking permissions, managing

times, inode data structure locking, etc.) is mostly done above that. The file system writer can concentrate

on the rewarding job of obtaining the data to make curious things look like files and directories to processes.

Page 7

- 7 -

3.2. VFS and vnodes

At first glance the VFS/vnode system is roughly similar to the above. The VFS and vnode operations struc-

tures are different types. The file system independent part of the VFS system implements the file system

name space, including mount points; it interprets path names. Once a particular file system has yielded its

root vnode to the VFS code when it is mounted in the file system hierarchy, or when it is crossed at a mount

point, it plays no further part in the interpretation of operations on files in the file system it represents. They

are directed through the operation structure attached to the vnode; the vnode operation set is extensive.

Sun's system is avowedly `object oriented': the group of function pointers linking to the operations is

attached directly to the data structure or `object' representing the file system (VFS) or the file (vnode). One

might then expect each vnode to accept just the operations it implements: the operation structure attached to

a `directory' in some file system would accept `lookup', `rename', and other such directory requests, but its

entry for `readlink' would point to an error routine. Similarly, a `symbolic link' would reject many other

operations but accept `readlink'. That is not what happens. In all the file systems in SunOS 3.5, every

vnode in a given file system has the same operation set, which accepts any operation that is allowed by

some kind of file in the file system. Each operation then checks a `file kind' tag to see what sort of file it is,

just like the other systems. (Indeed, it is sometimes checked twice: at the system call level, and by the

vnode operation. The second check is not redundant, since the NFS server takes a shortcut, bypassing the

first check by calling vnode operations directly.) As things stand, the operations are not really associated

with the vnode at all, but with the collection of vnodes inside a particular file system, and the contents of the

operation set are known statically. Putting the pointer in the vnode itself is just an optimisation.

3.3. Adding UNIX semantics

The VFS/vnode system is said to ``support (but not require) UNIX file system access semantics'' [Kle86].

This is accurate; the use of the phrase

`access

semantics' is telling. On a system call, the VFS code looks

up parameter pathname(s) to obtain a vnode (or two). After that, unlike the older file system switch, VFS in

most cases does little more than change the representation of its parameters and call the vnode routines. All

the file system types that I wished to add were UNIX file system types, but it appeared each would have to

maintain its own UNIX file system state and behaviour in the handling of user and group IDs, permissions

checking, and the like. Scanning the code of the existing file systems confirmed this impression, and also

revealed repetitions of significant code sequences. For instance,

specfs

nfs

and

ufs

all contained code to do

IO through the buffer cache.

Nfs

and

ufs

separately implemented the checks of the

access

system calls.

What a good way to get several (slightly different) copies of UNIX!

On the VAX system, the only file system types were UNIX ones, and so UNIX semantics had been imple-

mented just once above the file systems by the system call level. I decided to provide a file system `base

implementation' that would support UNIX file system implementations. The kernel continues to operate on

vnodes using Sun's interface and operations, although the VFS level was changed to provide vnode locking

primitives, as a service to be used as required by the lower levels. Most of the real work is done by a new

file system type.

The

unix

file type attends to the standard UNIX semantics for files, including maintenance of user and

group IDs, protections, file mode, file times, size, provision of a basic set of values for

stat

(`getattr'), and

a basic protection policy. It simplifies implementation of the

proc

file system, a

memory

file system (which

allows a directory structure and files to be created in main memory), and some others to be described

presently. It currently does not handle the

ufs

file system type (the 4.2 disc file system).

It also supports several types of anonymous vnodes, file types that lack names in the file system hierarchy:

currently

pipe

fifo

socket

, and

config

are included. The

pipe

and

fifo

types provide vnodes for all stream

IO (not just IO on pipes), the

fifo

subtype differing mainly in the open/close protocol. The

socket

and

config

types need more explanation.

3.4. Reconsidering socket and device switches

In 7th Edition, the system's `open file' structure referred directly to the inode corresponding to that file. In

4.2BSD a level of indirection was added, to direct IO on a file descriptor either to the socket IO system or to

the (inode) file system. The set of operations included read/write, close, ioctl and select, precisely a subset

of the vnode operations added later, yet both mechanisms remained in the Sun system after the VFS was

added at the next level down. The

socket

type provides suitable vnodes for IO on sockets, and has replaced

the older two-tier mechanism. Having all files in the system accessible through the file system interface

(inode or vnode) makes it more likely that applications can use any kind of file in most contexts.

Page 8

- 8 -

A character device is represented by open, close, read, write, ioctl and select operations selected by indirec-

tion through a `device switch'. Since vnodes provide a superset of these operations and the operation set

can be unique to the vnode, perhaps the device switch is redundant.

The (experimental)

config

file system can potentially replace the device switches. It provides a device name

space (hierarchy) derived automatically from the machine's configuration. Each file in it has its own set of

operations that refer to the appropriate device handler routines. Multiplexors are represented by directories

that generate channel vnodes on demand. (This might serve to eliminate major/minor device numbers from

the system, removing the need to make them larger on big systems.)

Config

differs from

specfs

(Sun's spe-

cial devices file system) in that it provides a mountable name space.

Specfs

makes a vnode for a device;

config

's vnodes

are

the devices (or pseudo-devices). It is an experimental file system because I am uncer-

tain how best to handle the file attributes. Originally in UNIX, all file systems were on disc, saving the per-

missions, ownership, and file times of device inodes over a reboot. Currently, because

config

is a virtual

hierarchy, without persistent store, attributes on some files must be set by

/etc/rc

when the default attri-

butes are not acceptable. This might turn out to be the best way.

3.5. Summary

I intended not so much to reduce the cost or size of the original implementation but rather to investigate its

structure and provide a framework in which new (UNIX) file system types can be added easily and cheaply.

The separation of responsibilities between the system call and vnode layers, the split between file system

types and file types, and the location of error checking all seem unsatisfactory. It seems ridiculous to me

that a UNIX system would provide so little support for UNIX semantics in the file system (the heart of

UNIX) that such things as permissions checking should be implemented several times. I question the value

of the current vnode system as a cost-effective tool when building UNIX file system implementations.

4. STREAMS

4.1. Streams (and STREAMS)

The stream IO system was originally developed by Dennis Ritchie for the 8th Edition. [Rit84]. It was later

adopted by System V. [Dun86]. I implemented a stream IO system on our VAX in 1985, because we were

installing various kinds of communications and networking facilities, and I thought stream IO would help to

make sense of it all. Sockets would also have been available then, but streams had the advantage that it was

much more than just a networking system. Stream IO is an elegant abstraction which is also extremely use-

ful! What characteristics appealed to me?

a stream is a full-duplex buffered channel into the kernel

a stream contains a sequence of bytes punctuated by delimiters

data is transformed only by processing modules on the stream

data is read and written using read and write

there are essentially no options

These help to make streams

neutral

in their application in the same way that any data can be written in any

format to a UNIX file, or sent through a UNIX pipeline. By contrast, sockets provide options for

datagrams, streams, sequenced packets, and a `raw' interface: they clearly provide a

networking

interface.

There are two ways of dealing with networks in UNIX using streams: make the complex network look like

a collection of simple streams [Pre85, Pre86], or change the stream system to look more like a network.

AT&T took the latter approach when the stream IO system became ``STREAMS technology'' in System V.

It acquired some new features: multiplexor drivers; the contents of a stream is now a sequence of messages,

which are themselves sequences of bytes; and there are now options at the stream head, including IO modes.

IO modes include `message discard', `message non-discard', and `byte stream': they provide different ways

of reading the data on the stream.

The result is something of a cross between the original stream system and sockets. The basic STREAMS

implementation is consequently somewhat larger than mine: the support for streams, queues, and buffering

takes 29,640 bytes in SunOS 4.0, compared to my system's 9,180 bytes. (The STREAMS figure does not

include support added recently for independent `out of band' message sequences within one stream

[Rag89].)

Page 9

- 9 -

4.2. Streams in SunOS

Last year, I installed a revised version of my streams system on our Sun system. Ritchie found that his ker-

nel became appreciably simpler, smaller, and more useful when streams replaced some character devices

and line disciplines. I rather hoped that might happen here. First, I replaced pipes by full-duplex streams,

and FIFOs (named pipes) by half-duplex streams. Then, I began to convert character IO device drivers to

streams. I wrote a tty editing module, based on Jim McKie's `EUUG' tty driver. Some user-level programs

were changed either to take advantage of the new structure, or to do work that the kernel once did. For

example,

/etc/init

and

/etc/getty

push the tty module onto a terminal stream after opening it.

Berkeley pseudo-terminals were discarded, and programs that used them were modified to use the appropri-

ate streams mechanisms instead.

After each stage the new kernel was indeed smaller, even though some new functionality had been added

(for instance, `message' and `connection' modules) [Pre85]. Certainly, one hopes for this result when

replacing an accumulation of many special cases and conventions by something with a clean, general

model. Much of the saving comes from eliminating device and line discipline `read' and `write' interface

routines (the streams code does the work).

Sun added STREAMS to SunOS 4.0. In every case the comparable STREAMS versions have grown

slightly larger, not smaller, than the components they replace. (The compatibility module

ttcompat

smaller than the old tty driver, but it simply translates

ioctl

calls for the new tty module

ldterm

, which

is larger.) The increase in the size of the tty modules on conversion to STREAMS is possibly because the

new ones implement all the System V conventions. Unfortunately, some of those were artifacts of the old

character IO system, and were overdue for replacement.

A good example is the

MIN/TIME

facility of the old terminal driver [Mic88]. It takes a full A4 page of

typeset text to describe the effect of different combinations of

MIN

and

TIME

values, and the restrictions!

One case provided buffering on a bursty asynchronous line (eg, for

uucp

): the driver waits until characters

stop arriving frequently enough or a minimum number accumulates. Another allowed a process to wait

until at least one character arrived or a timer expired. The first case could now be handled by pushing onto

the stream a buffering module similar to

bufld

(4) in the Ninth Edition [Lab86], which can be applied to any

stream, not just one with tty editing. (In fact, some programs that currently use

MIN/TIME

do not

want

tty

editing.) The second looks like an application for

poll

select

. Instead, new mechanisms were added

at the stream head to support the feature.

Streams also allow (indeed encourage) modules to be composed freely. Thus with streams, a Berkeley

pseudo-terminal can be replaced by a pipe with message and tty modules installed appropriately. Neither

module knows of the other's existence. (We implement simple text windows on our Sun 3/50s in this way;

a more useful `editing' shell window needs only the message module.) Of course, the particular application

program must be revised accordingly, since the interfaces are radically different. In SunOS, the pseudo-

terminals were instead `converted' to streams, but their rather intricate nature required the use of an `intra-

stream' protocol between the pseudo-terminal parts and

ldterm

4.3. Conclusion

There is no point in trying blindly to reproduce or even to emulate all of the old mechanisms using the

stream primitives, since the advantages of streams are quickly diminished, and the original point of adding

them lost. `Conversion' to streams provides a good opportunity to retire some of the older ideas that have

seen their best days (when they were the only way to do something). For instance, the Berkeley pseudo-

terminals could have been left in the system just as they were for compatibility, for a time, while people

converted their programs to use the new streams facilities suitably. Eventually we could toss the old rubb-

ish away, and be left with a clean, coherent system. Instead, we end up with new bugs in the old things, but

old bugs in the new things!

5. RE-IMPLEMENTATION

A common theme of the paging and streams sections above is `re-implementation'. Many versions of

UNIX were mainly ported and adapted to virtual memory machines, not thoroughly re-implemented. What

is the distinction?

By re-implementation, we mean re-programming and altering the structure of an existing sys-

tem where necessary, so as to better implement the function of that system. We distinguish re-

implementation from two other activities. These are the transportation of an existing system to

new hardware, and the design of a new system. If we transport a system, we leave its function

Page 10

- 10 -

and its structure unaltered. Re-implementation is not merely a combination of transportation

and re-design, but is a unique activity in its own right.

It is doing it again, but very well this

time.

[She77] [my italics]

Re-implementation treats the old version of the system as a prototype of the desired system. Functions may

be redistributed, added, or dropped. At no point, however, is the intent to provide a base for a completely

different system: all the mechanisms are honed to implement precisely the key ideas of the original system.

For example, Andrew Hume's

[Hum87] is a very good re-implementation of

make

: the underlying

model and syntax of the two programs are obviously closely related, but in the new program the old ideas

have been carefully refined, and it interacts more smoothly with the surrounding environment. In contrast,

both Mach [Acc86] and Amoeba [Mul89] are new systems that can emulate the UNIX environment well

enough to run UNIX programs, but the UNIX interface is not necessarily the best way to use them.

I think the essential ideas of UNIX are still worthwhile, but I also think it will take a good re-

implementation to bring new life to them.

6. CONCLUSIONS

Although recent versions of UNIX have finally adopted some of the clever ideas from `research' and else-

where, this evidently has not resulted in a great reduction in system size or complexity. The new mechan-

isms have been modified (or mangled) to emulate the clumsy features they were intended to replace. It is

important to keep a sense of proportion when deciding how much `backwards compatibility' to maintain.

It is dangerous to place too much hope in any improvement coming from just following new fashions, if we

lack insight into what really went wrong before. Without that insight, I suspect that rewriting UNIX in

C++, for example, could easily become an excuse for increasing complexity (because by using C++ `we can

handle more complexity').

The system we use has been put together using a traditional UNIX approach: make a careful selection of

mechanisms, and if possible, steal good ideas from elsewhere, but ensure they work together and fit the

underlying system structure. The result is a more powerful system which is less greedy for memory (and

my time). It still leaves room for Thompson's ``much-needed gap'' [Tho75].

Acknowledgements

Discussions with Martin Atkins about the state of UNIX and software in general have been particularly

profitable, and he made very helpful comments on drafts of this paper. Martin, Peter Whysall and several

others in the Department have been `friendly users' of this software, made many good suggestions, and

have given much encouragement. Dennis Ritchie very kindly sent a copy of the Ninth Edition manual.

References

Acc86. M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian and M. Young, ``Mach: A

New Kernel Foundation for UNIX Development'',

USENIX Association Conference Proceedings

Atlanta, Georgia, pp. 93-113 (Summer 1986).

Akh87. P. Akhtar, ``A Replacement for Berkeley Memory Management'',

USENIX Association Confer-

ence Proceedings

, Phoenix, Arizona, pp. 69-80 (Summer 1987).

Bra89. A. Braunstein, M. Riley and J. Wilkes, ``Improving the efficiency of UNIX file buffer caches'',

Operating Systems Review

(5), pp. 71-82 (1989).

Dun86. B. Duncanson, K. F. Storm and A. Facius, ``EUUG Technical Session Report: The Streams facility

in UNIX System V rel. 3'',

European UNIX User Group Newsletter

(1), p. 13, European UNIX Sys-

tem User Group Meeting Bella Center, Copenhagen - 10th - 13th September 1985 (Spring 1986).

Hum87. A. Hume, ``Mk: A Successor to Make'',

USENIX Association Conference Proceedings

, Phoenix,

Arizona, pp. 445-458 (Summer 1987).

Kil84. T. J. Killian, ``Processes as Files'',

USENIX Association Conference Proceedings

, Salt Lake City,

Utah, pp. 203-207 (Summer 1984).

Kle86. S. R. Kleiman, ``Vnodes: An Architecture for Multiple File System Types in Sun UNIX'',

USENIX

Association Conference Proceedings

, Atlanta, Georgia, pp. 238-247 (Summer 1986).

Lab86. AT&T Bell Laboratories, in

UNIX Programmer's Manual, Ninth Edition, Volume One

, ed. M. D.

McIlroy, Murray Hill, New Jersey (September 1986).

Page 11

- 11 -

Lef89. S. J. Leffler, M. K. McKusick, M. J. Karels and J. S. Quarterman,

The Design and Implementation

of the 4.3BSD UNIX Operating System

, Addison-Wesley, New York (1989).

Mic88. Sun Microsystems, pp. 1307 in

UNIX Interface Reference Manual, SunOS 4.0

, Mountain View,

California (May 1988).

Mil84. R. Miller, ``A Demand Paging Virtual Memory Manager for System V'',

USENIX Association

Conference Proceedings

, Salt Lake City, Utah, pp. 178-182 (Summer 1984).

Mul89. S. J. Mullender, ``Amoeba - High-Performance Distributed Computing'',

European UNIX Systems

User Group Spring Conference

, Brussels, Belgium, pp. 17-26 (April 1989).

Pik83. R. Pike, ``Graphics in Overlapping Bitmap Layers'',

ACM Transactions on Graphics

(2), pp. 135-

160 (April 1983).

Pik84. R. Pike, ``The Blit: A Multiplexed Graphics Terminal'',

AT & T Bell Laboratories Technical Jour-

nal

(8 part 2), pp. 1607-1632 (October 1984).

Pre85. D. L. Presotto and D. M. Ritchie, ``Interprocess Communication in the 8th Edition UNIX System'',

USENIX Association Conference Proceedings

, Portland, Oregon, pp. 309-316 (Summer 1985).

Pre86. D. L. Presotto, ``The Eighth Edition Unix Connection Service'',

European UNIX Systems User

Group Autumn Conference

, Florence, Italy, pp. 1-9 (April 1986).

Rag89. Stephen Rago, ``Out-Of-Band Communications in STREAMS'',

Proc. 1989 Summer USENIX

Conference

, San Francisco, California (June 1989).

Ree81. D. J. Rees, ``The Structure of the EMAS 2900 Kernel'', CSR-91-81, Unversity of Edinburgh, Dept

of Computer Science (August 1981).

Rif86. A. P. Rifkin, M. P. Forbes, R. L. Hamilton, M. Sabrio, S. Shah and K. Yueh, ``RFS Architectural

Overview'',

European UNIX User Group Newsletter

(2), pp. 13-23 (Summer 1986).

Rit84. D. M. Ritchie, ``A Stream Input-Output System'',

AT & T Bell Laboratories Technical Journal

part 2), pp. 1897-1910 (October 1984).

Rod86. R. Rodriguez, M. Koehler and R. Hyde, ``The Generic File System'',

USENIX Association Confer-

ence Proceedings

, Atlanta, Georgia, pp. 260-269 (Summer 1986).

Ros89. D. Rosenthal, ``More Haste, Less Speed'',

European UNIX Systems User Group Spring Confer-

ence

, Brussels, Belgium, pp. 123-130 (April 1989).

She74. N. A. Shelness, P. D. Stephens and H. Whitfield, ``The Edinburgh Multi-Access System Scheduling

and Allocation Procedures in theResident Supervisor'', EMAS Report 4, University of Edinburgh,

Department of Computer Science (April 1974).

She77. N. H. Shelness, D. J. Rees, P. D. Stephens and J. K. Yarwood, ``An Experiment in Doing it Again,

But Very Well This Time'', CSR-18-77, Unversity of Edinburgh, Dept of Computer Science

(December 1977).

Ste80. P. D. Stephens, J. K. Yarwood, D. J. Rees and N. H. Shelness, D. J. Rees and P. D. Stephens, ``The

Kernel of the EMAS 2900 Operating System'',

SOFTWARE - Practice and Experience

(7),

pp. 655-668 (July 1982).

Str86. J. H. Straathof, A. K. Thareja and A. K. Agrawala, ``UNIX Scheduling for Large Systems'',

USENIX Association Conference Proceedings

, Denver, Colorado, pp. 111-139 (Winter 1986).

Tho75. K. Thompson, ``The UNIX Command Language'', pp. 375-384 in

Structured Programming-

Infotech State of the Art Report

, Infotech International Ltd., Nicholson House, Maidenhead,

Berkshire, England (March 1975).

Wei84. P. J. Weinberger, ``The Version 8 Network File System (Abstract)'',

USENIX Association Confer-

ence Proceedings

, Salt Lake City, Utah, p. 86 (Summer 1984).

Wil75. M. V. Wilkes,

Time-Sharing Computer Systems, Third Edition

, Macdonald and Jane's, London

(1975).