What is Disaster Recovery?


Disaster Recovery can be defined as set of procedures and policies that enable the continuation or recovery of vital technology systems and infrastructure following natural or also human-induced disaster. The disaster recovery process focuses on IT or systems that support critical business functions which involves keeping essential aspects of the business functioning despite the significant disruptive events.

Organizations always can’t avoid disasters, but with some careful planning, effects of the disaster may be minimized. Objective of the disaster recovery plan is minimizing the downtime and the data loss. Primary objective is protecting organizations in the event which all or part of their operations and computer services are rendered as unusable. The recovery plan minimizes disruption of the operations and ensures that some levels of the organizational stability and orderly recovery after the disaster will prevail. The minimizing downtime and data loss operation is measured in the terms of two concepts which are the recovery point objective (RPO) and the recovery time objective (RTO).

The recovery time objective is that time within which the business process must be restored, after major incident has occurred.

The recovery point objective is that age of files which must be recovered from the backup storage for normal operations in order to resume if the system, computer, or network goes down as result of major incident.

The Development of Disaster Recovery

Disaster recovery was developed in late 1970s because computer center managers started to recognize dependence of their organizations on their systems. Most systems at that time were batch-oriented mainframes that could be down for some days before significant damage could be done to organization.

As the knowledge sensibility of potential business disruption which should follow the IT-related disaster, disaster recovery industry was developed in order to provide Sun Information Systems to the backup computer centers becoming the first major US commercial hot site vendor in 1978. (Sun Information Systems became later Sungard Availability Services).

During 1980s and 1990s, customer’s knowledge sensibility and this industry grew rapidly through an advent of real-time processing and open systems that increased the dependence of different organizations on their IT systems.

With the rapid growth during 1990s and 2000s of the Internet, organizations in different sizes became dependent on continuous availability of their IT systems. This increasing dependence on the IT systems, besides the increased knowledge sensibility from large-scale disasters like tsunami, flood, earthquake, and volcanic eruption, could spawn disaster recovery-related services and products, ranging from the high-availability solutions to the hot-site facilities.

The rise of the cloud computing technology in 2010 continues that trend and nowadays, it even matters less where computing services are served physically, just too long as network itself is reliable sufficiently. Recovery as a Service (RaaS) is now one of the security features of the cloud computing as it’s promoted by Cloud Security Alliance.

The Disaster Recovery Planning

Disaster recovery planning is known as Business Continuity Planning (BCP), it is a subset of large process and it includes planning for resumption of hardware, data, applications, networking and other IT infrastructure. The Business Continuity Plan includes planning for the non-IT related aspects like facilities, crisis communication, key personnel, and reputation protection.

Business Continuity Plans

  • Disaster Recovery Plan
  • Business Resumption Plan
  • Continuity of Operations Plan
  • Occupant Emergency Plan
  • Incident Management Plan

IT Disaster Recovery Control Measures

This control measures might be classified into the following types:

  1. Preventive Measures:

Controls that aimed to prevent event from occurring.

  1. Corrective Measures:

Controls that aimed to restore or correct systems after disasters or events.

  1. Detective Measures:

Controls that aimed to detect or discover unwanted events.

Good disaster recovery plans measure dictate which these 3 controls types be documented and regularly exercised using what is called “DR tests”.

Disaster Recovery Benefits

Like in any insurance plan, there are always benefits which can be obtained from drafting of disaster recovery plan like:

  • Providing the sense of security.
  • Guaranteeing the reliability of standby systems.
  • Minimizing the risk of delays.
  • Providing the standard for testing the plan.
  • Reducing the potential legal liabilities.
  • Minimizing the decision-making during the disaster.
  • Lowering the unnecessarily stressful work environment.

Disaster Recovery Planning Methodology

  • Obtaining the top management commitment

The central responsibility for your plan must reside on the top management in order to have a successful disaster recovery plan. The management is responsible of coordinating disaster recovery plans to ensure its effectiveness within organization. The management is also responsible for allocating the adequate resources and time required in the effective plan development. Resources which the management has to allocate include both the effort of all personnel involved and the financial considerations.

  • Establishing the planning committee

The planning committee should oversee the implementation and the development of the plan. This planning committee should include representatives from different functional areas in the organization. The key committee members include customarily the data processing manager and the operations manager. This committee also defines the plan scope.

  • Performing the risk assessment

After the planning committee prepares the risk analysis and the Business Impact Analysis (BIA) which includes range of possible disasters, such as technical, natural, and human threats, each functional area in the organization would be analyzed in order to determine potential consequence and the impact associated with various disaster scenarios. The process of risk assessment also evaluates safety of the critical documents and the vital records. The planning committee could also analyze costs that are related to minimizing potential exposures.

  • Establishing the priorities for processing and operations

The critical needs of each department in an organization should at this point be evaluated to prioritize them. The establishing priorities step is very important because there’s no organization that possesses infinite resources or criteria has to be set as to where to firstly allocate resources. Some areas often that are reviewed during prioritization process are key personnel and their functions, functional operations, historical records, processing systems used, existing documentation, information flow, services provided, plus the department’s procedures and policies.

The method that is used to determine the critical needs of the department is a method to document all functions performed by each separate department. Once primary functions have been identified, the processes and operations are later ranked in the order of priority: essential, non-essential, and important.

  • Determining the recovery strategies

The most practical alternatives during this step for processing in case of disaster are evaluated well and researched. Aspects of the organization are also considered, such as computer hardware and software, overall management information systems structure, physical facilities, data files and databases, end-user systems, user operations, communications links, customer services provided, and any other processing operations.

Other alternatives that are dependent upon evaluation of computer function, can include as example warm sites, hot sites, cold sites, duplication of the service center, provision of more than one data center, reciprocal agreements, lease of equipment, multiple computer system installation and deployment, consortium arrangements, and/or any combinations of the above.

  • Collecting data

Among advised data gathering materials or documentation usually included are different lists such as (Critical telephone numbers list, master vendor list, employee backup position listing, master call list, notification checklist), inventories such as (Off-site storage location equipment, documentation, communications equipment, microcomputer hardware and software, forms, insurance policies, office equipment, workgroup and data center computer hardware, office supply, telephones, etc.), distribution register, temporary location specifications, software and data files backup/retention schedules, and any other lists, materials, inventories and documentation. The pre-formatted forms are usually used in order to facilitate data gathering process.

  • Developing testing procedures and criteria

Disaster recovery plans must be tested and evaluated on regular basis but at least annually. Thorough disaster recovery plan includes the documentation with procedures for testing this plan. These tests will provide organizations with the assurance which all necessary steps are included in such plan. Some other reasons for testing are:

  • Providing the motivation for maintaining and updating disaster recovery plan.
  • Determining feasibility and compatibility of the backup procedures and facilities.
  • Providing training to team managers and members.
  • Identifying the areas in the plan which need modification.
  • Demonstrating the ability of organization to recover faster.
  • Testing the plan

After the testing procedures been completed, initial “dry run” plan is performed through conducting structured walk-through test. This test will provide an additional information towards any further changes in procedures which are not effective, steps which may need to be included, and other appropriate adjustments. Remember that these cannot become an evident unless actual dry-run test is performed. The plan is updated subsequently in order to correct any problems that are identified during the test. But initially, the testing of the plan will be done in sections and even after normal business hours in order to minimize disruptions to overall operations of organization and as plans are further polished, future tests also occur during the normal business hours.

Types of tests:

  • Checklist tests.
  • Full interruption tests.
  • Parallel tests.
  • Simulation tests.
  • Obtaining the plan approval

Once disaster recovery plan has been done written and tested, the plan is later submitted to the management for approval. This step is top management’s ultimate responsibility which organization has documented and tested the plan. The management is responsible of establishing the procedures, policies, and responsibilities for a comprehensive contingency planning, plus reviewing then approving contingency plan annually, and finally documenting reviews in writing.
Organizations which receive the information processing from a service bureaus will also need to evaluate adequacy of the contingency plan for its service bureau, plus to ensure that its contingency plans is really compatible with its service bureau’s plans.

The Disaster Recovery Plan Controversies

Due to the high cost of the disaster recovery plans, they aren’t without critics. Following are some common mistakes that organizations often make and are related to disaster recovery planning:

  • Incomplete RTOs (Recovery Time Objectives) and RPOs (Recovery Point Objectives).
  • Lax security.
  • Lack of buy-in.
  • Systems myopia.
  • Outdated plans.

That’s all for now. Hope you enjoyed this tutorial.