HPCx homepage
Services User support Projects Research About us Sitemap Search  
Hardware Software Storage Machine status The Grid Service Policies
home > services > policies > contingency

Contingency and Reversion Plan


This plan lays out the policies, commitments and intentions of HPCx in response to a number of possible events. Where appropriate, the details would be finalised at the time in consultation with EPSRC and user representatives. In some cases the details and timetables for carrying out these intentions would depend upon financial provision by our funding bodies, insurance companies, etc.

Loss of the administrative database: If the service's database is lost it will be restored from backup within one working day.

Loss of homespace data: In the event of the loss of homespace data files as a result of user error or non-catastrophic disk drive failure, the data will be restored within one working day from backup.

Loss of tape media: In the event of the destruction of all or some of the on-site tapes, the second copy of the HSM file store, which will be kept off-site, can be used.

Disk failure: In the event of a catastrophic failure of a disk drive, homespace data files (which are backed up) will be recovered within three days, using the off-site copies of the backup tapes if required. If the entire disk store, or a substantial part of it, is lost, we will try to use similar disk resources elsewhere, while replacing the equipment as soon as possible. IBM has the capability to provide access to LTO drives and disks at its Greenford site in order to make this possible.

Loss of the tape store: In the event of the destruction of the LTO tape store or a large part of it, we will obtain a replacement from IBM as soon as possible. In the meantime we will if possible use a comparable LTO tape store in the UK. IBM has a comparable LTO tape store at its Greenford site.

Storage capacity loss: In the event of the destruction of part of the disk store or of the LTO tape store, the Service will continue in a reduced mode while replacements are obtained.

Loss of compute nodes: In the event of the destruction of some of the compute nodes, the Service will continue in a reduced mode while replacements are obtained. In the event of the destruction of all, or the great majority of the compute nodes, we will try to get access to a replacement resource, but as there are few systems of similar power in the world, this is likely to be problematic.

Damage to support premises: Support activities will be distributed between the premises of the University of Edinburgh and CCLRC. In the event of a major disruption at either site, it will be possible to continue support without significant interruption by transferring personnel and activities temporarily to the other. This will include training courses, workshops and in-depth projects, for example.

Damage to system premises: In the event of damage to the building where the system is housed, repairs will be carried out as quickly as possible. In the meantime, as far as possible, the service will continue, in a reduced mode if necessary.

Destruction of the system premises: If the building where the system is housed is totally destroyed, rebuilding could take as long as a year. We will consider whether it might be preferable to transfer operations to the CCLRC's site at the Rutherford- Appleton Laboratory, or to the University of Edinburgh's Bush Estate site; this might be preferable to rebuilding on site, especially if a large proportion of the hardware is destroyed as well.

Loss of staff: We are confident that the current staff roster of the service is capable of covering for the loss of any individual member of staff. In the event of the simultaneous loss of several members of staff, the service would be able to draw on the staff of EPCC and CCLRC as a whole.

Cessation of the service: Should the HPCx partners be unable to maintain the service for whatever reason, or should their performance be judged inadequate by the standards laid down in the contracts, operation of the service will be assumed by EPSRC. Users' data will be transferred to their care. Arrangements have been made to ensure that third- party software licences would be transferred as well. Software originating from within the service, including the website, will be regularly placed in escrow.

November, 2002

http://www.hpcx.ac.uk/services/policies/contingency.html contact email - www@hpcx.ac.uk © UoE HPCX Ltd