THE FUTURE OF THE EMAS ARCHIVE. Abstract This is a discussion document on possible changes to the Emas Archive; it is important to note that no decisions have yet been made and there is still time for other proposals to be submitted and further feedback from user consultations to be considered fully. It provides some history and background to refresh peoples' minds, but its main purpose is to highlight the problems associated with an ever growing though popular, "user-friendly" archive filestore held on a physical medium which is cheap rather than ideal. The following is a summary of the main issues raised in the text: 1. The main problem is the physical size of the magnetic tape based Archive filestore and we postulate that it should not be allowed to grow at its present rate indefinitely. 2. Two basic proposals are made in an attempt to address this problem a. It is proposed that the Emas-2 Archive is not automatically perpetuated into an Emas-3 Archive by something equivalent to the "4-75 to 2900 transition compaction" which we performed in 1980. b. It is suggested that the indirect method of adding files to the Archive filestore is stopped on Emas-3; (this results from the weekly tidy of the on-line discs when inactive files are removed to make way for the following weeks growth). 3. The consequences of these basic proposals are considered and alternative strategies for ameliorating their impact are discussed briefly. Notation - Archive, Backup etc. are Emas concepts - ARCHIVE, RESTORE, RETRIEVE etc. are Emas commands - Emas2900, Emas-2, and Emas-3 are operating system generations - EMAS, BUSH, EMAS-A, etc. are Emas services (hosts) Background On-line files fall into two categories, cherished or uncherished (hazarded). Users themselves control which of their files fall into each category. If a file is important to a user he can "cherish" it which ensures that the System takes copies of it at regular intervals and holds these copies on the separate medium of "off-line" magnetic tapes. The accumulation of such files for all users constitutes the Emas Backup filestore. (It should be noted that the Backup filestore is essentially transient in that the magnetic tapes are held in a "cycle" and their contents are overwritten with more up-to-date data every few weeks.) If a user file is destroyed as a result of hardware or system software corruption, the Backup is automatically checked to see if the file was cherished and if it was the file is reloaded into the users on-line index again. Currently, a "checkpoint" backup is taken once a week (called the Weekly Backup), when all cherished files are copied to the Backup filestore. Also, an "incremental" backup is taken once each working day (called the Daily Backup), when all cherished files which have been altered since the last backup are again copied to the Backup filestore. Although users themselves cannot retrieve files from Backup, they can request Operations staff to reload a cherished file which they themselves have corrupted or accidentally destroyed. ERCC attempts to cover a period of about 4 weeks with the Backup tape cycle (i.e. four complete checkpoint and incremental cycles) although only the last set should ever be required for system purposes. User files which have longer term significance can be moved from the on-line filestore to the Emas Archive filestore. This is also held on the medium of Magnetic Tape but of course here the tapes are not recycled as in the Backup system. Files can be moved to Archive by one of two methods. A user can mark which files he wishes to be "archived" by a specific call of the Emas ARCHIVE command; he can do this for cherished or hazarded files. However, files can also "drift" out to Archive as a result of the mechanism which is employed to keep the on-line disc filestore from being constantly clogged up ("FSYS full") with user files which are no longer "active" (i.e. currently being accessed). At present, once a week, all files which have not been accessed for a period of 28 days are removed from the on-line filestore (i.e. are "destroyed"). However, if a file being so deleted also happens to have been cherished, it is first of all copied to the Archive filestore (on the basis that 4 weeks ago the user thought it important enough to be protected by the Backup system). Currently each file moved to the Archive filestore is copied on to 2 separate magnetic tapes; one copy is retained in the computer room so that the user can "restore" any of his archived files at short notice, and the other copy is held in a different building for fire security reasons. History The Emas Archive was initiated on 15th December 1972 and has grown steadily ever since. It has proved popular with users as a repository for data which "might be useful someday". Between 1975 and 1980 a yearly "compaction" of the Archive filestore was done in order to limit the growth of the physical medium and also to ensure the long term readability of files on magnetic tape (this was at a time when we were not sure whether long term readability was a problem or not - the evidence so far is that this is not a serious problem). The last major compaction was carried out on the move of the Emas Archive to the ICL 2900 series machines on the closure of the original 4-75 service in June 1980. The EMAS 2900 Archive was begun in 1979 on the interim 2970 Service and this was merged with the converted 4-75 data in July 1980. Since then the Emas Archive has grown on the BUSH(2980/2988) and EMAS(2972/2976) services separately, but is regarded as one integrated Archive; we can move users' indexes between the two computers and retain full access to their archive material, and we have taken care that at least one copy of the Archive can be read on the "other" machine. The intention for the past two years was to do another major compaction on the move to the ICL Atlas computer which was our first choice to replace the dual 2976 configuration. However, the Computer Board's decision to give us the smaller Amdahl V7 in addition to the 2976s has led us to rethink our archiving strategy since it means that the Archive will now grow on three services until at least 1986, probably 1987, and just possibly 1990. Physical Medium Magnetic Tape was chosen as the medium for the Archive store because tape drives already existed on mainframes and tape reels were relatively cheap. It is not necessarily the best medium and has disadvantages in the storage space which is required for an infinitely growing Archive. Manufacturers have marketed special devices for archive material but these have been rather expensive ($100000 to $250000) and often have too high a capacity for the amounts of data that a University is likely to accumulate; the Emas Archive currently holds over 27000 megabyte (27 gigabytes) of non-discarded data. It had been hoped that optical disc technology would have advanced enough to reduce prices to values comparable with magnetic tape equipment by this time, but this still has not happened and it is now unlikely that we will be able to purchase suitable discs in the near future. Consequently, our future strategy must be based on the Amdahl V7's tape provision of four 125 inches/sec decks recording at 6250bpi (bits per inch). The 2988 already has 6250bpi capability but the 2976 has not. It is our intention to install a minimal 6250bpi provision on the 2976 this Summer to replace our current aged 1600bpi equipment and have this shareable with the 2988's decks when it is housed at JCMB in September. However, it is proving difficult to obtain suitable equipment (second-hand or new) at present. The Problem Simply stated, the overall problem is that we have a growing number of (already too many) archive tapes and it is not felt practical to go through a major compaction exercise again. We currently have some 2500 tape pairs and the Archive filestore increases by between 6 and 10 pairs per week, and we must still plan for a steadily increasing archive requirement on three separate services at present. A Solution The simple solution to the above problem is to let the Emas 2900 Archive "die" (with appropriate safeguards) with the departure of each ICL 2900 provision and to slow down the growth of the Emas-3 Archive (in an acceptable way). 1. How practical and how acceptable is this? 2. What has to be done to "soften" the effect of such a strategy? Proposal for future Archive Strategy The proposal for Emas-3 is to return to the original interpretations of the "archive" and "cherish" markers. In particular "cherish" would no longer have a "backdoor" effect on the Archive filestore; that is, files which have been "unused for 28 days" would no longer be added to the Archive. Only files positively marked via ARCHIVE would be written to the Emas-3 Archive. The obvious problem with this is that users have become used to the passive additions of unused cherished files to Archive. It is, however, a practical proposal in that we know many, indeed most of such files are never restored again. ( We know from a recent analysis of actual user restores that over two-thirds of archived files still in the Archive filestore have never been restored; we also know from spot checks throughout the years that most of these files have drifted to Archive rather than having been positively requested to go.) Since it is obviously not acceptable to extend the weekly deletion of unused uncherished files to encompass the unused cherished files as well, we are looking at ways of "softening" the introduction of this change. The method we are considering at present is simply for the system not to tidy up unused cherished files from the on-line discs at all, (although the unguaranteed "HAZARDed" files would still be liable to be deleted at the discretion of the System Management if necessary). What we are hoping to do is to design a strategy which would encourage users to review their inactive files regularly, archiving those they really do wish to keep but destroying many others which would (under the present regime) simply drift out to Archive. Such a strategy would have to ensure that users didn't try to keep their files on-line for ever and we are considering methods of discouraging this initially by warnings at log-on and then by preventing users from doing "new" work if (say) they have gone over a certain percentage threshold of inactive files in their on-line index; the sole aim of this being to encourage good housekeeping. We would propose to try this strategy out as the Emas-3 service on the Amdahl (EMAS-A) is being built up in order to gauge user reaction to the new mechanisms in practice. It is currently thought that the procedures on Emas-2 services are unlikely to change. The move from Emas-2 to Emas-3 How would a decision not to perpetuate the Emas-2 Archive affect users' movement to the Amdahl's Emas-3 service? One way would be to request users who were moving, to first of all review all their archived files on the 2900 service and either discard them or TRANSFER them to their Amdahl index themselves. The other way would be for ERCC to provide a transfer aid to allow the Emas 2900 Archive to be accessed on the Amdahl service. A suggested method which would not compromise the basic decision not to compact and perpetuate the 2900 Archive, would be to allow users to RECOVER those files which they still wanted to their on-line index on the Amdahl once only from a special copy of the "2900 archive index", with the "recovered" file being then automatically discarded from this Amdahl copy. The first option would involve users in more work but would be "successful" in a shorter period of time. The second is the more friendly and more attractive therefore, especially for users with large archives, but prolongs the active use of the Emas-2 Archive tapes. In either event we should make it clear that ERCC would retain at least one copy of the old archive tapes (in store) for one or two years after the closure of the relevant service (as was done following the closure of the twin 4-75 installation). Summary The above proposals are meant to cut the size (and therefore the on-going problem) of the Emas Archive until "optical disc" media is available at economic costs when perhaps another review of what should be "archived" could be undertaken, (although there is a body of opinion which believes we should reconsider our archiving strategy now even if we could solve our present problems simply by buying a "juke-box" of optical discs). In any event, although it doesn't help the immediate physical problem, it would seem sensible if users in general began an unhurried appraisal of their current Archive index soon since it can only help the long term aims discussed here. A. McKendrick C.D.McArthur 24th May 1985