Disaster Recovery could be a Disaster

At the moment I am preparing to do what could be considered a high-risk, high impact change at work – upgrading the schema of our Active Directory environment followed by raising the forest functional level to Windows 2008 Native. As part of this I am developing a Disaster Recovery plan for the chance that it all goes wrong.  Here are some of the things worth considering for a DR Plan:

Access to the Documentation

In the event that it all goes wrong , are you able to readily access the DR documentation? I would be very concerned if your answer at this point is no. Ideally DR Plans are in hardcopy, or located on a PC that does not require network connectivity to access the documents. Consider multiple copies in case your DR sees your documentation destroyed or otherwise unavailable (eg. fire) but make sure that there are processes in place to ensure documentation is current.

Documentation Standard

Can other members in your group follow the instructions to recover the affected systems? Writing documentation is all well and good, but if only you are able to follow it then there is a key dependancy issue. What happens if the disaster occurs after you have left the company or are on holiday? DR Documentation should be concise for the target audience.

Practice makes perfect

Is your DR Plan all theoretical? System Admins know all to well that theoretical things rarely work. It might be time to set up an isolated test environment and then arrange for it to “break” and run through the DR Plan with the team. A couple of important things will come from this  – one will be that people involved will become familiar with the DR Plan and the other will be that you will obtain a time it will take to recover the system.

Certainly I have discovered that in the process of building the test domain that things are not as simple as they seem. a day’s work was recently wasted when I broke the PDCe in the test domain. Better there than in production….

Communication & Milestones

Don’t forget you probably have a helpdesk and managers taking a lot of heat from other employees and managers about the situation. Make sure that as part of your DR plan you have milestones where someone must take the time to communicate with others the progress. In the case of the AD Failure, some of my milestones will include:

  • Recovery of the first Domain Controller
  • First Successful Replication to a sibling DC
  • Completely recovered

It is also important to establish lines of communication. Usually in the event of DR, all communication should go via a supervisor/manager. That way the people fixing the problem are not unnecessarily interrupted and providing differing accounts of the situation.

Involve Everyone

An addendum to communication and practice is to involve everyone that would be involved in the event of a disaster. In all likelihood, the helpdesk will be the first alert that something is wrong, so on the day you want to test your DR Plan, get them to trigger the Disaster by alerting the appropriate team to the problem. Which other stakeholders should be involved? Who does the testing to make sure everything is working again? Everyone that needs to be involved, needs to be involved.

An IT disaster should not be a time to panic. There should be a well organised process to recover the system and that everyone knows what to do. If you have not got a tested DR Plan then that DR Recovery could be a disaster in itself.

Do you have any other tips for creating a DR Plan? Share them below.