This is the html version of the file
G o o g l e automatically generates html versions of documents as we crawl the web.
To link to or bookmark this page, use the following url:

Google is not affiliated with the authors of this page nor responsible for its content.

More Taste: Less Greed? or Sending UNIX to the Fat Farm
Page 1
More Taste: Less Greed?
Sending UNIX to the Fat Farm
C H Forsyth
Department of Computer Science
University of York
York YO1 5DD
+44 904 432753
You (like us) have 80 to 90 Sun 3/50 machines with 4 megabytes of memory. You
have been given some optical discs containing System V.4. Which can you least afford to
discard? Things are getting out of hand. Soon, 32 megabyte Ultrasparks will be needed
just for us to start the window system in under a minute.
For UNIX, now in middle-age rotundly recalling its sprightly youth, mere `tuning' will
not cause that heavy code to slip away. We need to reconsider and re-implement the sys-
tem interface periodically, to take account of changes in its environment. We must be
willing to
throw things away
, discarding parts of the older implementation completely,
rather than corrupting clean new mechanisms to approximate the mistakes of the past.
To illustrate this thesis, I shall discuss work I have done on SunOS 3.5 to reduce its size
and complexity. The virtual memory system has been replaced by a simpler one using
ideas from the EMAS system and elsewhere. A stream IO system in the 8th/9th Edition
style has been added, replacing the old character IO system, pipes, and UNIX domain
sockets. I have also made some preliminary forays into the Virtual File System interface.
Many papers describe various ways of dealing with the complexity of `programming in the large'. There
are fewer that take a large program apart to try to find out why it is large, how it grew, and whether it can be
made rather smaller. The commonly available versions of the UNIX system have been growing relentlessly
with each new release and each new feature; the `standard' UNIX kernel is now a moderately large pro-
gram, surrounded by very large applications. I have watched it become much harder to understand, main-
tain, and modify. This troubles me directly (since I provide system software support), and so over the past
few years I have made various attempts, very much part-time, to recode or rewrite parts of the system to be
smaller and simpler. The work began on a VAX with the 3rd Berkeley Distribution, and continues on the
Sun 3 series with SunOS 3.5. The changes have been growing more extensive with time.
Here, I shall pick examples from various parts of the system to show how certain designs match the struc-
ture of UNIX well, and thus make things easier; and conversely, where there is a serious clash, the imple-
mentation becomes much more complicated and the facilities less useful. It can also be less painful in the
end to replace some older things completely, rather than pile hack upon hack. I claim it is a mistake to stan-
dardise (or to have standardised) certain parts of the system interface in their current form.
The UNIX kernel I ran until 1987 on our VAX machines was a `re-port' of 7th Edition to the VAX, but it
included: a recoded 3BSD paging system, many devices, an X.25 interface, and my own implementations of
ideas from 8th Edition UNIX (which was not generally available), specifically the file system switch
[Kil84], and a `stream' IO system [Rit84]. The latter included pseudo-terminals, the mes-
sage and connection disciplines, and stream drivers and modules that provided some of the Ethernet and
Internet protocols:
, and
. There were two peculiar special files:
By the standards then prevailing the kernel was acceptably small:

Page 2
- 2 -
/unix: 95280+20400+146252 = 261932b = 0x3ff2cb
(All kernel tables are included in the data and bss sizes; IO buffers are not.) Starting with `layers' [Pik83]
code by Simon Kenyon, Nigel Roles wrote an implementation of Pike's window system
[Pik84] that
was good enough to run programs such as the text editor
for the Atari ST. We had a promising
In 1987, however, we decided for financial reasons to switch to Suns (discless 3/50s supported by file
servers). I was disconcerted, not to say disheartened, to find that the Sun kernel code was about 390 kbytes,
even properly configured (as distributed it was 490k). It had notionally been built on 4.2 BSD, but the
changes to support the Virtual File System and `vnodes' [Kle86] were extensive, although the character IO
system and socket systems were largely unaffected.
I decided to discover why the system was so much larger and (apparently) complicated than previous UNIX
systems. I thought that a good way to do this was gradually to replace parts of it, and compare the size and
complexity of the old and new versions. Rewriting code forces one to (re)consider all the problems the ori-
ginal author was trying to address. Furthermore, a `paper' redesign would not only be unconvincing, it
would do little to provide the facilities, particularly streams, that had been lost by discarding the VAX sys-
A significant constraint was that the system must run SunOS 3.x binaries: we had software such as
Framemaker for which we had no source, and programs would also need to be exportable without change.
Excluding devices, the larger components of the kernel were: the virtual memory system; the window sys-
tem; the 4.2BSD File System; basic kernel functions including fork, exec, signals, and scheduling; the net-
work file system; sockets and buffering, including pipes; virtual file system support; and the network
software. I decided to look at a few of them more closely, taking advantage of the experience gained on the
Some degree of improvement was ultimately made in all areas except the Fast File System and the network
side of the socket code. (Eventually both might be replaced.) The CPU scheduling code was replaced by a
variant of a simple yet flexible scheduler developed at the University of Maryland [Str86].
The new system has been in use in various forms for more than a year on several Sun 3/50s and one 3/160
in the Department. Depending on configuration, it is currently about 120kbytes smaller than the old (the
Sun window system still takes about 60 kbytes in the current kernel). There is still much scope for
improvement, but even now the system does more, uses memory more efficiently, and is (I claim) better
In following sections I shall discuss the changes to the paging system, changes made or currently being
made to the virtual file system, the installation of the stream IO system, and the implications of all that on
kernel implementation and applications.
The SunOS paging system was completely replaced. The new system uses a `working set' paging scheme
with a local paging policy, instead of the previous global one. Some implementation ideas, and also exten-
sive justification for this approach, were found in the published descriptions of the EMAS system [Ree81,
Ste80, Ste80]. Two other virtual memory systems, by Richard Miller [Mil84], and by a group at Gould
[Akh87], both written for versions of UNIX, were also examined closely (our various implementations
differ). My goal was to provide a reasonable implementation of the basic UNIX process model on a
machine with virtual memory, not to add new primitives; I wished the result to have very few moving parts.
The use of the EMAS scheduling algorithm is perhaps the most novel aspect.
2.1. The original paging algorithm
The Berkeley paging system uses a `clock' algorithm to approximate LRU page replacement across the sys-
tem. This has undergone some adjustments with time, but the basic scheme is as follows. (See [Lef89] for
precise details.) When a process requires a new page it simply takes a free one from the global page pool as
soon as one is available. Pages are made available when processes terminate, but also by the intervention of
a system process, the `pageout daemon'. There is a big array, the `core map', with one entry for each
usable page of memory. The pageout daemon cycles through the array at a rate determined by some
scheduling parameters. Free pages are skipped. Any other page is examined to see if it can be discarded.

Page 3
- 3 -
The daemon checks the page table entry for the page to see if the `page used' bit is set. If not, the page can
be made available. If it has been modified, it must first be written out (asynchronously). If the page has
been referenced, the daemon resets the `used' bit and proceeds to the next core map entry. If the page has
not been touched the next time round, it will be freed. On systems with large memories, a complete cycle
can take some time, and so recent versions have the daemon cycling two pointers through the map: one
clears `used' bits; the other reclaims pages.
In principle, this is a simple and elegant technique. It has been applied quite effectively in a controlled
environment in a new implementation of file cacheing in UNIX using virtual memory [Bra89]. There are,
however, several problems in practice with the algorithm when applied to UNIX processes, at least in its
SunOS 3.5/BSD 4.2 version: the core map must keep a back reference to a page table entry, a process's
working set size is not always accurate, and the memory scheduling algorithms are elaborations of the early
`swapping' algorithms.
The daemon must be able to locate `the page table entry' for each page. The core map entry for a physical
page must therefore point back to the page table entry that refers to it, if there is one. Map entries for data
and stack pages store a (process, page) pair. Pages of an executable program, though, can be referenced by
several such entries, which must all be found. The solution adopted is to store (text-table, page) in the core
map. Processes using a given executable are linked in a chain from its text table entry. Whenever one of
the processes' page table entries changes, owing to page in or page out, all others must change to match it.
The pageout daemon looks round the list too, in order to calculate the union of used/modify bits for all
processes referring to the page.
When the page daemon takes a page, it invalidates the page table entry but leaves a reference to the page
(the page frame number), so that should the process soon fault on that page it can reclaim the page directly
from the free list. Before any process is given a free page, the memory allocator must therefore clear any
reference to it, using the data in the core map in much the same way as the pageout daemon. Just to liven
things up, in 4.2BSD the network system can claim pages at interrupt level using the memory allocator, so
that a process's page table entries can be changed by an interrupt! (In practice this is not a problem pro-
vided care is taken at the kernel main level, but it is something else to have to reason about when checking
the code.)
To reduce overhead, the old pageout daemon does not run until the amount of free memory drops below a
specified value. It then cycles round the core map at a rate that depends on the amount of memory left. If it
cannot free enough memory quickly enough, the memory scheduler (swapper) is started. The swapper syn-
chronously writes to disc clumps of the modified pages from the whole image of each process it selects.
2.2. Simplified paging
Some of the complexity in the old paging system results from using a global paging policy, which causes
pointers to go `the wrong way' for UNIX, where pointers elsewhere almost invariably go from a process to
a shared resource which is given a reference count. For instance, when sharing executable code, pointers go
from the process to a `text table' which has the reference count. When sharing open files, pointers go from
a per-process open file table to a reference-counted system open file table (which has the seek pointer),
which points in turn to the file's in-core `inode' (or vnode), which again has a reference count. In contrast,
the solution adopted for sharing text pages is adequate in practice, but becomes messier when data pages are
to be shared (for instance, in some implementations of `fork').
Pointers, then, should go from the process to memory (the shared resource). Pages can be shared easily if
there is a reference count for each page. In the new system, a process releases its own pages when it no
longer requires them. Periodically, each process examines its page table entries and discards pages that
have not been used since the last such scan. It resets the `used' bit on the remaining pages. Over time, the
process tracks its own working set.
When a process discards a page during a working set scan or a complete page out, the page is put in either a
free page or modified page list depending on whether it has been modified since being read in. Pages on the
modified list are written to disc by a separate system process (the new `pageout daemon'). It does not write
pages out immediately, but only when the modified page list grows too long, or when the free list length
drops below a specified level. This increases the chance that the page can be reclaimed by its process
without IO overhead. Pages are returned to the free list after they have been written out. The new pageout
daemon is a simple loop: it waits for work, collects pages from the modified list, and writes them out.
By itself, use of a working set scan in each running process is not enough to prevent thrashing, since the
combined working sets of all such processes might not fit in memory. The collection of processes in

Page 4
- 4 -
memory must be subject to some global constraint. The most obvious constraint to apply is that before a
process is loaded there must be enough store to hold its working set. This typically changes with time,
however, and so a scheduler is required that can adapt to these changes.
The EMAS system has an elegant scheduling system [She74] that was chosen for use in our system. Each
process in EMAS is assigned a category, a single number that captures its current resource requirements,
and also determines its CPU and memory priority, working set scan frequency, and CPU quantum size.
When loaded by the scheduler, a process is given a page and CPU time allocation determined by its
category. It remains loaded and in the CPU despatcher's queues until it exhausts either allocation or goes to
sleep. The process must then submit itself for rescheduling.
The scheduler refers to a resource model for the system, encoded in a
category table
which is indexed by
process categories. The table can be viewed as a little state machine with transitions from category to
category. There are four transitions from each category, labelled `more memory', `more time', `unused
memory', and `unused time'. The scheduler follows the appropriate path to obtain a new category that suits
the process. In this way the scheduler quickly tracks the character of a process as it changes.
The rescheduled process may remain loaded if it is runnable, its new allocation fits in the available memory,
and no higher priority process is waiting. It must otherwise page itself out, and join a priority queue to wait
for memory. A process sleeping on an external event must likewise page itself out whilst waiting unless the
memory it occupies is not in demand. When it wakes, it joins the appropriate memory queue. A memory
scheduler process cycles through the priority queues using an auxiliary priority ratio table, loading each
process in turn as memory becomes available. (See the EMAS papers for details. Wilkes [Wil75] includes
the EMAS scheme in an interesting discussion of paging viewed as a problem in control theory.)
The notion of an `external event' is hard to sustain in modern UNIX systems. Instead, the priority of a
sleep is assumed to say something about the likely speed of response. When a process sleeps at high prior-
ity, it records the unused part of its quantum, adjusts its CPU priority (see below), and enters the sleep
queues as usual. A process sleeping at low priority first checks with the memory scheduler before entering
the sleep queues. If its memory is required now, it will page itself out, as in EMAS, but otherwise it puts
itself in a priority queue of processes that are `idle' in memory.
The new memory scheduler has a simple loop: it checks the `wait memory' queues in the order given by the
priority ratio table, and once it has found a process, it waits for memory to become available, as lower prior-
ity processes page themselves out. It also watches for the arrival of idle processes, and when memory is
needed causes them to page themselves out, starting with low priority processes, as soon as they have been
idle `long enough' (a scheduling parameter).
The original scheduler had the hardware clock interrupt increment a `time' field in each process descriptor
to decide how long a process had been in or out of memory, or sleeping. Berkeley moved this to a function
called periodically by a software clock, but it still required a scan of all living processes. Instead, in this
system, a process is simply time stamped when it enters any (FIFO) scheduling queue. (Currently the stamp
is the time in ticks since system startup.)
2.3. Improving UNIX implementation
The previous section gave an overview of the old and new paging systems. Let us look now at some areas
in more detail, to compare the complexity of the implementations.
call makes a copy of the caller's memory image. As many have observed, this is often
wasteful, since in most cases the
precedes an
which throws the copy away! Paging everything
in to make the copy will also disrupt tracking of the working set. Fortunately, copying is just the
specification; we are free to implement it in the most efficient way. Many have suggested (and several have
implemented) sharing memory between parent and child until one of them writes to a page; the write is
intercepted, the page is copied, and the write is completed in a private copy of the page. On some machines
copy-on-write is hard or costly to implement, but copy-on-reference can be used instead and should be
nearly as effective in the most common case (speeding up
Berkeley nevertheless implemented
by complete copy, and added a new system call
The parent passes its page tables to the child. The parent is suspended, marked `no VM', and linked to the
child, which is marked `in vfork'. The child process returns from
and continues to run in the parent's
memory until
, at which time it wakes its parent, and waits. The parent grabs its memory
This refers to the kernel sleep/wakeup synchronisation system, not to the C library function.

Page 5
- 5 -
back, resets some pointers, and allows the child to proceed. Several parts of the system must check for `in
vfork' (including the terminal handler!). The system calls
must synchronise with (the
, and the pageout daemon must follow the vfork links to find the current user of the page if
its true owner is marked `no VM'. Thus
is moderately expensive, and
is rather odd.
The new system does use copy on write. The
call copies the swap maps, incrementing the swap
reference counts; copies the page tables setting `copy on write' state and read-only access; and increments a
core map reference count for valid pages (a page will not be put in the free list until the reference count
goes to zero). There can then be two kinds of page access faults. On a `write protect' error, a page that is
no longer shared can have its permissions restored; a shared page must be copied. In either case the page
becomes private to the process. (A private page also results from a `page missing' fault if the page is `fill
on demand'.) Including the code to implement reference counts on swap pages, no more code is required to
implement copy on write in this system than
took in the old.
Page theft.
In the original system, the pageout daemon wanders round the core map stealing pages. Each
time, it must check the state of the process owning the page. If it is in a safe state, it temporarily maps in
the process's kernel data area to access the swap maps. If the page being stolen had been modified, the dae-
mon mapped it into its own VM and started asynchronous IO.
In the new system, nothing happens `asynchronously' to a process as regards the paging system, since it
either discards its own pages or must queue itself to allow the memory scheduler to do so. It simply does
neither if it is in an inconvenient state, or if scheduling policy does not yet require it. The new working set
scan routines
come to 51 lines including some optional event tracing calls
(except when a page must be freed, no functions are called during a scan), but many more lines of code
have been removed.
Reclaiming pages.
In either system, a process might lose or discard a page, then require it again before it
has been taken from the free list for use elsewhere. As mentioned above, the old system leaves the page
table entry non-zero to allow reclaim, but clears it as soon the page is taken for use elsewhere. It also has a
hash table, in which pages can be found using the pair (source-vnode,block) as a key.
In his system, Richard Miller [Mil84] had an array of disc block addresses, one per memory page, each
entry containing either zero or the current swap block address of the page. When a page table entry was
invalidated, the page frame number was left non-zero. Before reclaiming a page, the system checked that
the swap block address matched that expected by the current process. No action was necessary if the page
was claimed by another process. A similar scheme is used here (the swap address is kept in the core map,
for use by the pageout daemon). There is currently no need for a hash table. Each shared image (text) entry
provides a `master page table' that tracks the location of executable pages, which might page out from one
frame and page in elsewhere some time later (but which should still be shared).
A swap operation in the old system writes out all the modified pages in a process. There are
special support routines that find contiguous regions of dirty pages that are then split to fit in contiguous
regions of swap space. There are some others that, during page out, find modified memory pages that are
adjacent in virtual memory to the one selected by the page out daemon.
In the new system, a swap is just a complete process page out, which simply frees the process's memory;
clean pages go on the free list, and dirty ones go on the modified page list. If dirty page table entries are
adjacent, and they are freed in sequence, and at the same time, their associated pages will be adjacent on the
modified page list. The core map entries for the pages have the swap addresses attached to assist page
reclaim. The pageout daemon simply checks adjacent modified page list entries to see if the swap blocks
are adjacent and if so writes them out with one IO request. This gives the effect of a `partial swap' at little
cost. The same effect is obtained by a process that releases adjacent dirty pages during a working set scan.
The cost of a bad scheduling decision (eg, a sleeping process is swapped and immediately wakes up) is
reduced, since a process can reclaim pages from the modified page list before they are written. (The per-
process data and page tables are paged out and reclaimed in the same way.)
User control.
The use of working sets should help to address some of the problems mentioned by
Rosenthal [Ros89], since the scheduler can more accurately track a process's memory requirements. Even
an overenthusiastic process can be kept within bounds, paging mainly against itself, so that on a single-user
machine for instance, a text editor running in a windowing system at the same time as such a process has a
better chance of remaining responsive compared to the old system.
Scheduling policy can be controlled in several ways, most simply by changing the category table, although
Stephens et. al. [Ste80] do warn that the effects of doing so are not always easy to predict, and might require
experiment. The scheduling parameters and tables are accessible through special files in the directory

Page 6
- 6 -
, making it easy to give workstation users control over the allocation of resources to
processes on their machines. Windowing processes can be given a set of categories disjoint from the others,
just as EMAS provided different sets of categories for interactive and batch processes.
2.4. Summary
The new paging system is just over half the size of the old one. I would claim that the new one is easier to
reason about, since the various processes involved interact only when each is in a well defined state. They
co-operate: a process does not barge into another's memory without knocking.
The first `file system switch' appeared in the 8th Edition. Several interesting file system types were imple-
mented using it: the network file system
[Wei84] and the process file system
[Kil84]. Others have
followed suit: the Generic File System [Rod86], and the Remote File System [Rif86] use (somewhat more
elaborate) file switches. Sun developed the Virtual File System (VFS) and vnodes [Kle86].
I had implemented my own impression of the 8th Edition file system switch on our VAX. (`My own ...'
because very few details had been published about the 8th Edition switch.) When we then moved to Suns, I
took several of the older file system types, such as
, and tried to adapt them to work in the VFS/vnode
system. I found it harder to use: it did no work for me! I did produce an acceptable solution to my immedi-
ate problems, but the VFS interfaces remain unsatisfactory. Closer study revealed some confusion and limi-
tations in its design and implementation. To see why, we must look a little closer at what these systems do.
3.1. A simple UNIX file system switch
Adding a file system switch in the 8th Edition style to the 7th Edition system is fairly easy to do. A value is
added to each mount table entry that tells which type of file system is mounted there. That value is used as
an index into a table of file system operations, and the parts of the mount table entry used only by the disc
file system are moved elsewhere. I found that extensive changes to the kernel source code could be avoided
by keeping the
as the `generic file', so that references such as
continue to work; the
disc-specific information goes elsewhere. This seems reasonable: an inode
an abstract file. The interface
to a file system is similar to the interface to character or block device handlers. Kernel functions such as
`read inode' are replaced by something like
void readi(struct inode *ip)
The function bodies are renamed and moved into files associated with the disc file system type; suitable
entries are made in the switch table. More work is required to add indirection to file name lookup and when
managing the `inode table' (iget, iput), and one might consider changing the calling conventions to hide
from the file systems the structure of the per-process data area, but this should give the basic idea.
Within any file system, there have always been different kinds of files: directories, devices, sockets, FIFOs,
symbolic links, and the `regular' file. In 7th Edition, each file is tagged with a value telling its kind (IFDIR,
IFREG, etc). The internal versions of read and write can be applied to any kind of file. Reading a regular
file yields its data, but reading a symbolic link produces a file name, and reading a directory gives a list of
(inode-number, name) pairs in some format. The internal version of write can also
(inode-number, name) pair into a directory; it thus creates a link.
The same scheme can be used with the file system switch. The implementation of `read' for the file system
will look at the kind of file to decide what to do. Note that all file systems must agree on a standard format
in which they will exchange directory entries with the kernel during read and write. This need not be
related to the directory format that the file system uses internally. For example,
makes up directory
entries `on the fly' out of the contents of the kernel's process table. It is not necessary to add a special
`readdir' or `getdirents' call.
Implementation in this way puts the file system switch below the bulk of the UNIX system call/file system
interface code. The job of implementing UNIX semantics (pathnames, checking permissions, managing
times, inode data structure locking, etc.) is mostly done above that. The file system writer can concentrate
on the rewarding job of obtaining the data to make curious things look like files and directories to processes.

Page 7
- 7 -
3.2. VFS and vnodes
At first glance the VFS/vnode system is roughly similar to the above. The VFS and vnode operations struc-
tures are different types. The file system independent part of the VFS system implements the file system
name space, including mount points; it interprets path names. Once a particular file system has yielded its
root vnode to the VFS code when it is mounted in the file system hierarchy, or when it is crossed at a mount
point, it plays no further part in the interpretation of operations on files in the file system it represents. They
are directed through the operation structure attached to the vnode; the vnode operation set is extensive.
Sun's system is avowedly `object oriented': the group of function pointers linking to the operations is
attached directly to the data structure or `object' representing the file system (VFS) or the file (vnode). One
might then expect each vnode to accept just the operations it implements: the operation structure attached to
a `directory' in some file system would accept `lookup', `rename', and other such directory requests, but its
entry for `readlink' would point to an error routine. Similarly, a `symbolic link' would reject many other
operations but accept `readlink'. That is not what happens. In all the file systems in SunOS 3.5, every
vnode in a given file system has the same operation set, which accepts any operation that is allowed by
some kind of file in the file system. Each operation then checks a `file kind' tag to see what sort of file it is,
just like the other systems. (Indeed, it is sometimes checked twice: at the system call level, and by the
vnode operation. The second check is not redundant, since the NFS server takes a shortcut, bypassing the
first check by calling vnode operations directly.) As things stand, the operations are not really associated
with the vnode at all, but with the collection of vnodes inside a particular file system, and the contents of the
operation set are known statically. Putting the pointer in the vnode itself is just an optimisation.
3.3. Adding UNIX semantics
The VFS/vnode system is said to ``support (but not require) UNIX file system access semantics'' [Kle86].
This is accurate; the use of the phrase
semantics' is telling. On a system call, the VFS code looks
up parameter pathname(s) to obtain a vnode (or two). After that, unlike the older file system switch, VFS in
most cases does little more than change the representation of its parameters and call the vnode routines. All
the file system types that I wished to add were UNIX file system types, but it appeared each would have to
maintain its own UNIX file system state and behaviour in the handling of user and group IDs, permissions
checking, and the like. Scanning the code of the existing file systems confirmed this impression, and also
revealed repetitions of significant code sequences. For instance,
all contained code to do
IO through the buffer cache.
separately implemented the checks of the
system calls.
What a good way to get several (slightly different) copies of UNIX!
On the VAX system, the only file system types were UNIX ones, and so UNIX semantics had been imple-
mented just once above the file systems by the system call level. I decided to provide a file system `base
implementation' that would support UNIX file system implementations. The kernel continues to operate on
vnodes using Sun's interface and operations, although the VFS level was changed to provide vnode locking
primitives, as a service to be used as required by the lower levels. Most of the real work is done by a new
file system type.
file type attends to the standard UNIX semantics for files, including maintenance of user and
group IDs, protections, file mode, file times, size, provision of a basic set of values for
(`getattr'), and
a basic protection policy. It simplifies implementation of the
file system, a
file system (which
allows a directory structure and files to be created in main memory), and some others to be described
presently. It currently does not handle the
file system type (the 4.2 disc file system).
It also supports several types of anonymous vnodes, file types that lack names in the file system hierarchy:
, and
are included. The
types provide vnodes for all stream
IO (not just IO on pipes), the
subtype differing mainly in the open/close protocol. The
types need more explanation.
3.4. Reconsidering socket and device switches
In 7th Edition, the system's `open file' structure referred directly to the inode corresponding to that file. In
4.2BSD a level of indirection was added, to direct IO on a file descriptor either to the socket IO system or to
the (inode) file system. The set of operations included read/write, close, ioctl and select, precisely a subset
of the vnode operations added later, yet both mechanisms remained in the Sun system after the VFS was
added at the next level down. The
type provides suitable vnodes for IO on sockets, and has replaced
the older two-tier mechanism. Having all files in the system accessible through the file system interface
(inode or vnode) makes it more likely that applications can use any kind of file in most contexts.

Page 8
- 8 -
A character device is represented by open, close, read, write, ioctl and select operations selected by indirec-
tion through a `device switch'. Since vnodes provide a superset of these operations and the operation set
can be unique to the vnode, perhaps the device switch is redundant.
The (experimental)
file system can potentially replace the device switches. It provides a device name
space (hierarchy) derived automatically from the machine's configuration. Each file in it has its own set of
operations that refer to the appropriate device handler routines. Multiplexors are represented by directories
that generate channel vnodes on demand. (This might serve to eliminate major/minor device numbers from
the system, removing the need to make them larger on big systems.)
differs from
(Sun's spe-
cial devices file system) in that it provides a mountable name space.
makes a vnode for a device;
's vnodes
the devices (or pseudo-devices). It is an experimental file system because I am uncer-
tain how best to handle the file attributes. Originally in UNIX, all file systems were on disc, saving the per-
missions, ownership, and file times of device inodes over a reboot. Currently, because
is a virtual
hierarchy, without persistent store, attributes on some files must be set by
when the default attri-
butes are not acceptable. This might turn out to be the best way.
3.5. Summary
I intended not so much to reduce the cost or size of the original implementation but rather to investigate its
structure and provide a framework in which new (UNIX) file system types can be added easily and cheaply.
The separation of responsibilities between the system call and vnode layers, the split between file system
types and file types, and the location of error checking all seem unsatisfactory. It seems ridiculous to me
that a UNIX system would provide so little support for UNIX semantics in the file system (the heart of
UNIX) that such things as permissions checking should be implemented several times. I question the value
of the current vnode system as a cost-effective tool when building UNIX file system implementations.
4.1. Streams (and STREAMS)
The stream IO system was originally developed by Dennis Ritchie for the 8th Edition. [Rit84]. It was later
adopted by System V. [Dun86]. I implemented a stream IO system on our VAX in 1985, because we were
installing various kinds of communications and networking facilities, and I thought stream IO would help to
make sense of it all. Sockets would also have been available then, but streams had the advantage that it was
much more than just a networking system. Stream IO is an elegant abstraction which is also extremely use-
ful! What characteristics appealed to me?
a stream is a full-duplex buffered channel into the kernel
a stream contains a sequence of bytes punctuated by delimiters
data is transformed only by processing modules on the stream
data is read and written using read and write
there are essentially no options
These help to make streams
in their application in the same way that any data can be written in any
format to a UNIX file, or sent through a UNIX pipeline. By contrast, sockets provide options for
datagrams, streams, sequenced packets, and a `raw' interface: they clearly provide a
There are two ways of dealing with networks in UNIX using streams: make the complex network look like
a collection of simple streams [Pre85, Pre86], or change the stream system to look more like a network.
AT&T took the latter approach when the stream IO system became ``STREAMS technology'' in System V.
It acquired some new features: multiplexor drivers; the contents of a stream is now a sequence of messages,
which are themselves sequences of bytes; and there are now options at the stream head, including IO modes.
IO modes include `message discard', `message non-discard', and `byte stream': they provide different ways
of reading the data on the stream.
The result is something of a cross between the original stream system and sockets. The basic STREAMS
implementation is consequently somewhat larger than mine: the support for streams, queues, and buffering
takes 29,640 bytes in SunOS 4.0, compared to my system's 9,180 bytes. (The STREAMS figure does not
include support added recently for independent `out of band' message sequences within one stream

Page 9
- 9 -
4.2. Streams in SunOS
Last year, I installed a revised version of my streams system on our Sun system. Ritchie found that his ker-
nel became appreciably simpler, smaller, and more useful when streams replaced some character devices
and line disciplines. I rather hoped that might happen here. First, I replaced pipes by full-duplex streams,
and FIFOs (named pipes) by half-duplex streams. Then, I began to convert character IO device drivers to
streams. I wrote a tty editing module, based on Jim McKie's `EUUG' tty driver. Some user-level programs
were changed either to take advantage of the new structure, or to do work that the kernel once did. For
push the tty module onto a terminal stream after opening it.
Berkeley pseudo-terminals were discarded, and programs that used them were modified to use the appropri-
ate streams mechanisms instead.
After each stage the new kernel was indeed smaller, even though some new functionality had been added
(for instance, `message' and `connection' modules) [Pre85]. Certainly, one hopes for this result when
replacing an accumulation of many special cases and conventions by something with a clean, general
model. Much of the saving comes from eliminating device and line discipline `read' and `write' interface
routines (the streams code does the work).
Sun added STREAMS to SunOS 4.0. In every case the comparable STREAMS versions have grown
slightly larger, not smaller, than the components they replace. (The compatibility module
smaller than the old tty driver, but it simply translates
calls for the new tty module
, which
is larger.) The increase in the size of the tty modules on conversion to STREAMS is possibly because the
new ones implement all the System V conventions. Unfortunately, some of those were artifacts of the old
character IO system, and were overdue for replacement.
A good example is the
facility of the old terminal driver [Mic88]. It takes a full A4 page of
typeset text to describe the effect of different combinations of
values, and the restrictions!
One case provided buffering on a bursty asynchronous line (eg, for
): the driver waits until characters
stop arriving frequently enough or a minimum number accumulates. Another allowed a process to wait
until at least one character arrived or a timer expired. The first case could now be handled by pushing onto
the stream a buffering module similar to
(4) in the Ninth Edition [Lab86], which can be applied to any
stream, not just one with tty editing. (In fact, some programs that currently use
do not
editing.) The second looks like an application for
. Instead, new mechanisms were added
at the stream head to support the feature.
Streams also allow (indeed encourage) modules to be composed freely. Thus with streams, a Berkeley
pseudo-terminal can be replaced by a pipe with message and tty modules installed appropriately. Neither
module knows of the other's existence. (We implement simple text windows on our Sun 3/50s in this way;
a more useful `editing' shell window needs only the message module.) Of course, the particular application
program must be revised accordingly, since the interfaces are radically different. In SunOS, the pseudo-
terminals were instead `converted' to streams, but their rather intricate nature required the use of an `intra-
stream' protocol between the pseudo-terminal parts and
4.3. Conclusion
There is no point in trying blindly to reproduce or even to emulate all of the old mechanisms using the
stream primitives, since the advantages of streams are quickly diminished, and the original point of adding
them lost. `Conversion' to streams provides a good opportunity to retire some of the older ideas that have
seen their best days (when they were the only way to do something). For instance, the Berkeley pseudo-
terminals could have been left in the system just as they were for compatibility, for a time, while people
converted their programs to use the new streams facilities suitably. Eventually we could toss the old rubb-
ish away, and be left with a clean, coherent system. Instead, we end up with new bugs in the old things, but
old bugs in the new things!
A common theme of the paging and streams sections above is `re-implementation'. Many versions of
UNIX were mainly ported and adapted to virtual memory machines, not thoroughly re-implemented. What
is the distinction?
By re-implementation, we mean re-programming and altering the structure of an existing sys-
tem where necessary, so as to better implement the function of that system. We distinguish re-
implementation from two other activities. These are the transportation of an existing system to
new hardware, and the design of a new system. If we transport a system, we leave its function

Page 10
- 10 -
and its structure unaltered. Re-implementation is not merely a combination of transportation
and re-design, but is a unique activity in its own right.
It is doing it again, but very well this
[She77] [my italics]
Re-implementation treats the old version of the system as a prototype of the desired system. Functions may
be redistributed, added, or dropped. At no point, however, is the intent to provide a base for a completely
different system: all the mechanisms are honed to implement precisely the key ideas of the original system.
For example, Andrew Hume's
[Hum87] is a very good re-implementation of
: the underlying
model and syntax of the two programs are obviously closely related, but in the new program the old ideas
have been carefully refined, and it interacts more smoothly with the surrounding environment. In contrast,
both Mach [Acc86] and Amoeba [Mul89] are new systems that can emulate the UNIX environment well
enough to run UNIX programs, but the UNIX interface is not necessarily the best way to use them.
I think the essential ideas of UNIX are still worthwhile, but I also think it will take a good re-
implementation to bring new life to them.
Although recent versions of UNIX have finally adopted some of the clever ideas from `research' and else-
where, this evidently has not resulted in a great reduction in system size or complexity. The new mechan-
isms have been modified (or mangled) to emulate the clumsy features they were intended to replace. It is
important to keep a sense of proportion when deciding how much `backwards compatibility' to maintain.
It is dangerous to place too much hope in any improvement coming from just following new fashions, if we
lack insight into what really went wrong before. Without that insight, I suspect that rewriting UNIX in
C++, for example, could easily become an excuse for increasing complexity (because by using C++ `we can
handle more complexity').
The system we use has been put together using a traditional UNIX approach: make a careful selection of
mechanisms, and if possible, steal good ideas from elsewhere, but ensure they work together and fit the
underlying system structure. The result is a more powerful system which is less greedy for memory (and
my time). It still leaves room for Thompson's ``much-needed gap'' [Tho75].
Discussions with Martin Atkins about the state of UNIX and software in general have been particularly
profitable, and he made very helpful comments on drafts of this paper. Martin, Peter Whysall and several
others in the Department have been `friendly users' of this software, made many good suggestions, and
have given much encouragement. Dennis Ritchie very kindly sent a copy of the Ninth Edition manual.
Acc86. M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian and M. Young, ``Mach: A
New Kernel Foundation for UNIX Development'',
USENIX Association Conference Proceedings
Atlanta, Georgia, pp. 93-113 (Summer 1986).
Akh87. P. Akhtar, ``A Replacement for Berkeley Memory Management'',
USENIX Association Confer-
ence Proceedings
, Phoenix, Arizona, pp. 69-80 (Summer 1987).
Bra89. A. Braunstein, M. Riley and J. Wilkes, ``Improving the efficiency of UNIX file buffer caches'',
Operating Systems Review
(5), pp. 71-82 (1989).
Dun86. B. Duncanson, K. F. Storm and A. Facius, ``EUUG Technical Session Report: The Streams facility
in UNIX System V rel. 3'',
European UNIX User Group Newsletter
(1), p. 13, European UNIX Sys-
tem User Group Meeting Bella Center, Copenhagen - 10th - 13th September 1985 (Spring 1986).
Hum87. A. Hume, ``Mk: A Successor to Make'',
USENIX Association Conference Proceedings
, Phoenix,
Arizona, pp. 445-458 (Summer 1987).
Kil84. T. J. Killian, ``Processes as Files'',
USENIX Association Conference Proceedings
, Salt Lake City,
Utah, pp. 203-207 (Summer 1984).
Kle86. S. R. Kleiman, ``Vnodes: An Architecture for Multiple File System Types in Sun UNIX'',
Association Conference Proceedings
, Atlanta, Georgia, pp. 238-247 (Summer 1986).
Lab86. AT&T Bell Laboratories, in
UNIX Programmer's Manual, Ninth Edition, Volume One
, ed. M. D.
McIlroy, Murray Hill, New Jersey (September 1986).

Page 11
- 11 -
Lef89. S. J. Leffler, M. K. McKusick, M. J. Karels and J. S. Quarterman,
The Design and Implementation
of the 4.3BSD UNIX Operating System
, Addison-Wesley, New York (1989).
Mic88. Sun Microsystems, pp. 1307 in
UNIX Interface Reference Manual, SunOS 4.0
, Mountain View,
California (May 1988).
Mil84. R. Miller, ``A Demand Paging Virtual Memory Manager for System V'',
USENIX Association
Conference Proceedings
, Salt Lake City, Utah, pp. 178-182 (Summer 1984).
Mul89. S. J. Mullender, ``Amoeba - High-Performance Distributed Computing'',
European UNIX Systems
User Group Spring Conference
, Brussels, Belgium, pp. 17-26 (April 1989).
Pik83. R. Pike, ``Graphics in Overlapping Bitmap Layers'',
ACM Transactions on Graphics
(2), pp. 135-
160 (April 1983).
Pik84. R. Pike, ``The Blit: A Multiplexed Graphics Terminal'',
AT & T Bell Laboratories Technical Jour-
(8 part 2), pp. 1607-1632 (October 1984).
Pre85. D. L. Presotto and D. M. Ritchie, ``Interprocess Communication in the 8th Edition UNIX System'',
USENIX Association Conference Proceedings
, Portland, Oregon, pp. 309-316 (Summer 1985).
Pre86. D. L. Presotto, ``The Eighth Edition Unix Connection Service'',
European UNIX Systems User
Group Autumn Conference
, Florence, Italy, pp. 1-9 (April 1986).
Rag89. Stephen Rago, ``Out-Of-Band Communications in STREAMS'',
Proc. 1989 Summer USENIX
, San Francisco, California (June 1989).
Ree81. D. J. Rees, ``The Structure of the EMAS 2900 Kernel'', CSR-91-81, Unversity of Edinburgh, Dept
of Computer Science (August 1981).
Rif86. A. P. Rifkin, M. P. Forbes, R. L. Hamilton, M. Sabrio, S. Shah and K. Yueh, ``RFS Architectural
European UNIX User Group Newsletter
(2), pp. 13-23 (Summer 1986).
Rit84. D. M. Ritchie, ``A Stream Input-Output System'',
AT & T Bell Laboratories Technical Journal
part 2), pp. 1897-1910 (October 1984).
Rod86. R. Rodriguez, M. Koehler and R. Hyde, ``The Generic File System'',
USENIX Association Confer-
ence Proceedings
, Atlanta, Georgia, pp. 260-269 (Summer 1986).
Ros89. D. Rosenthal, ``More Haste, Less Speed'',
European UNIX Systems User Group Spring Confer-
, Brussels, Belgium, pp. 123-130 (April 1989).
She74. N. A. Shelness, P. D. Stephens and H. Whitfield, ``The Edinburgh Multi-Access System Scheduling
and Allocation Procedures in theResident Supervisor'', EMAS Report 4, University of Edinburgh,
Department of Computer Science (April 1974).
She77. N. H. Shelness, D. J. Rees, P. D. Stephens and J. K. Yarwood, ``An Experiment in Doing it Again,
But Very Well This Time'', CSR-18-77, Unversity of Edinburgh, Dept of Computer Science
(December 1977).
Ste80. P. D. Stephens, J. K. Yarwood, D. J. Rees and N. H. Shelness, D. J. Rees and P. D. Stephens, ``The
Kernel of the EMAS 2900 Operating System'',
SOFTWARE - Practice and Experience
pp. 655-668 (July 1982).
Str86. J. H. Straathof, A. K. Thareja and A. K. Agrawala, ``UNIX Scheduling for Large Systems'',
USENIX Association Conference Proceedings
, Denver, Colorado, pp. 111-139 (Winter 1986).
Tho75. K. Thompson, ``The UNIX Command Language'', pp. 375-384 in
Structured Programming-
Infotech State of the Art Report
, Infotech International Ltd., Nicholson House, Maidenhead,
Berkshire, England (March 1975).
Wei84. P. J. Weinberger, ``The Version 8 Network File System (Abstract)'',
USENIX Association Confer-
ence Proceedings
, Salt Lake City, Utah, p. 86 (Summer 1984).
Wil75. M. V. Wilkes,
Time-Sharing Computer Systems, Third Edition
, Macdonald and Jane's, London