HPCx homepage
Services User support Projects Research About us Sitemap Search  
Hardware Software Storage Machine status The Grid Service Policies
home > services > policies > data_backup

Data Backup Policies


System backup

Data stored in all Users' designated homespaces, in general, 10% of the total storage, shall be regularly backed up with the following currency:

Backups will be taken of all user's home directory space, and depending on balance of file system allocation, some work directory space and will be stored on re-usable LTO media. The volume of data eligible for backup is 5TB (10% of the total). Incremental backups will be taken once per day. It is assumed that the daily incremental change will be 10-20%. This results in a daily incremental volume of at most 1TB. If the daily change is more than 20%, the Service will be enhanced to meet the requirements.

In addition to the daily incremental backup a weekly snapshot or full backup of the 5TB will also be taken. These full backups will be maintained for a period of 4 weeks - i.e. after the full backup for week 5 is taken, the full backup for week 1 will be expired leaving 4 copies.

In order to spread the workload of the full backup throughout the week the 5TB will be split into sections of approximately 1TB which will be backed up on different days. It is assumed that each of these 1TB storage sections will themselves be made up of a small number of different file systems. On a day when a 1TB section is being backed up in full it will not also be backed up incrementally. The worst case backup volume on any one day is therefore 1TB of full backup and 0.8TB of incremental backup.

The aim is to complete the backup process (full and incremental) within a single 8 hour shift and to create the duplicate copies during the remaining two 8 hour shifts. An IBM LTO tape drive has a native data transfer rate of 15MB/s assuming no data compression. Taking into account transactional overhead (TSM) and allowing some margin for error the sustainable transfer rate for each drive is taken as 10MB/s. Backing up 1.8TB of data in an 8 hour period at a sustained transfer rate of 10MB/s will therefore require a minimum of 7 tape drives and assumes that 7 streams of data can be maintained during that period of time. Given the way in which the 5TB of storage is split into a number of 1TB sections which are further split into a number of file systems this will not be an issue.

Duplicate copies of the full and incremental backup data will be created each day which will be stored outside the tape library in a safe location off site. Using 12 tape drives (6 -> 6) to create the tape copies and assuming a sustainable transfer rate of 10MB/s it will be possible to duplicate the full and incremental backup data in a 9 hour period.

Tape reclamation for the onsite and duplicate tapes will be performed during off-peak hours and at weekends.

The HSM storage will require an amount of GPFS disk space to act as cache. The initial size of this cache will depend on user storage and recall patterns and will most likely be at least 1TB in size. The size of this cache file system may need to be changed over time as a better understanding of user activity is gained. It will be desirable to have more than one cache file system as this will improve the scalability of the number of HSM files which can be managed. It is assumed that the files stored under HSM management will tend to be large - average 50-100MB or greater in size.

Every file written to the tape storage by HSM will subsequently be copied, so that there will be duplicate of every file. This will enable data to be recovered in the event of a media failure, for example. Creation of the duplicate HSM data will be performed throughout the day - user activity will take priority and preempt duplication processes as required for access to tapes and tape drives.

The storage hardware underpinning these policies is described here.

The back up capability will evolve with the storage such that the requirement is met in each Phase of the Service.

In the event of a catastrophic system failure, off-site backed up data shall be made available to Users within 3 working days. In this case, data will be recovered from the secondary backup set stored off-site within the required time.

In the event of a non-catastrophic system failure or user error, on-site backed up data shall be made available to Users within 1 working day. In this case, data will be recovered from the secondary backup set stored off-site within the required time.

October 2002

http://www.hpcx.ac.uk/services/policies/data_backup.html contact email - www@hpcx.ac.uk © UoE HPCX Ltd