His office is on the seventh floor of a building that's nowhere near a floodplain, so Robert Rosen had no particular fear of water damage to his IT equipment. But one weekend, in the office next door, the water filter in an office kitchen cracked, sending a stream of water onto the floor and under the wall into his facilities.
"It flooded our offices and made it up to the server room door. It made us shut our office down," said Rosen, CIO at the National Institute of Arthritis and Musculoskeletal and Skin Diseases. The agency is part of the National Institutes of Health at the U.S. Department of Health and Human Services in Bethesda, Md.
Although critical servers remained dry, the flood ruined equipment that was on the office floor, including 10 surge protectors, six uninterruptible power supplies, six power bricks and one PC. While things were drying out and a length of wallboard was replaced, Rosen implemented a disaster recovery plan that was crafted for an entirely different contingency.
"We didn't plan for someone else's office to take us down, but we used a plan we had in place for pandemic flu. People teleworked for three weeks," Rosen said. "The telework plan saved us; otherwise we'd have been dead. We didn't have enough overflow space for the staff."
Floods, fires, power failures and pandemic flu can happen. Every IT professional must envision the impact of such disasters on company operations and devise tactics to deal with them. But first, take a step back and start with a comprehensive assessment of all the risks your business faces, of which IT vulnerabilities are an important part.
"In a risk assessment, there is a whole lot more to it than people think," Rosen said. Every organization is different; because Rosen's agency is involved in scientific research, IT equipment is only one concern. "We need to think about lab rats," he said.
Step one: Form a committee
For most companies, the best approach is to form a committee of executives and managers each with a stake in preventing disasters and getting the company back on its feet. The cast typically includes the chief risk officer and C-level managers from facilities, operations, finance, legal, human resources, public relations, investor relations, physical plant security and IT.
When such a group is formed, it must have the blessing and support of top management -- with a commitment ahead of time, if possible, to put into action the disaster recovery plan created by the group. Support also should include internal publicity to raise awareness corporate-wide of the committee and its mission. Once the plan is in place, it is critical that all employees understand its purpose and their roles in carrying it out.
But it's not as easy as it might seem to convene a risk-assessment task group, come up with findings and act on them. According to Stamford, Conn.-based research firm Gartner Inc., obstacles include the following:
- A lack of demonstrated benefit of enterprise risk assessments.
- A lack of skilled personnel to perform assessments.
- Risk assessments that don't generate specific, implementable recommendations.
- Risk-assessment methodologies that are too time-consuming.
What's more, said F. Christian Byrnes, an analyst at Gartner, there is often a communication disconnect between those with technical and nontechnical backgrounds. In response, he said that IT staff should seize the opportunity to take a leadership role. "When it comes time to make decisions with input from multiple sources, the discussion needs to be led by disaster recovery, security or IT staff," Byrnes said.
The right tools
To help enterprises overcome these obstacles, various vendors sell risk-assessment software and services. The major vendors are SAS Institute Inc., IBM and SunGard; smaller vendors and consultancies include Coop Systems, eBRP Solutions Inc. and Paisley Software.
In addition, IT professionals can avail themselves of standard methodologies for risk assessment, including COBIT (Control Objectives for Information and related Technology) and ITIL (the IT Infrastructure Library). For its part, Gartner has created the Gartner Risk Assessment Method (GRAM), a way of understanding risks.
Whether such aids are necessary is open to debate. "We have not looked at risk-assessment software," said Brian Jaffe, the IT director at a New York City-based media company and author of the book IT Manager's Handbook. "My fear is that a lot of it might just be boilerplate. Most of it is common sense."
"We never used tools for risk assessment, but we had lots of discussions with end users and the IT team. We evaluated the impact of particular resources and applications being unavailable and the risk of the probability of something happening to that application," Jaffe said.
Whether software tools, consultants or methodologies are used, the resulting plan should include the input of both IT and business people in assigning varying levels of importance to different business processes and data. Some types of data will need to be recovered instantly, other types of data can go several hours or days without being available; still other data can be lost entirely with no adverse effect on a business.
John Stevenson, an independent consultant in Plano, Texas, and a former CIO at Avaya Inc., Sharp Corp., Bristol-Myers Squibb and other companies, said the key question is how fast a company loses revenue when a system is down.
"If you are unable to conduct a transaction, what is the likelihood of loss of customers and what is the impact of that loss on your business?" A travel website, he said, would lose revenue instantly if its service were to go offline. And, he noted, the risk posed by an outage to a company's reputation must be taken into account as well.
While he was CIO at Avaya, Stevenson weighed the costs and benefits of having dual data centers that could fail over to each other in case of disaster. "You want to get to zero [downtime], but what's the cost?" he said. While the cost was high, the benefit of minimal downtime to Avaya's 900,000 customers was also high. Stevenson found that by automating much of Avaya's call center operations he could create sufficient savings to afford the two data centers.
Stevenson noted that the risks to which a disaster recovery site is exposed must be considered as well. For example, he said, it might be unwise to locate both a primary data center and a disaster recovery site in the Northeast, a region that has historically been vulnerable to winter storms and summer spikes in energy consumption.
Different plans for different threats
As Rosen's experience demonstrates, different and unexpected kinds of disasters may strike -- and each entails different risks and remedies. Pandemic flu, for example, requires a plan that enables employees to work remotely so they are not exposed to other people who may carry germs. It's a situation in which people are more at risk than corporate data.
But the loss of a facility through fire or flood is more likely to jeopardize data rather than people, meaning critical data must be backed up and available at a secure disaster recovery facility.
"Disasters tend to be localized and shorter in duration. Pandemics last longer and are not localized," said Thomas Grobicki, the CEO of Avilar Technologies Inc., a provider of competency management and e-learning solutions. Other threats, including a terrorist attack, could threaten people and data at the same time.
When data must be continuously available even if a primary site and its people are lost, a remote facility with data that is mirrored from the principal site in real time is required. Since people are needed as well, it is advisable to operate the secondary site with a regular staff and, from time to time, have it perform as the primary site.
When some downtime is acceptable, a remote site that is backed up daily either by data transmission or by transporting tape cartridges is the best solution.
One of the most important factors to keep in mind is the need to contact people in an emergency. This task isn't as easy as it might seem; employees come and go and their contact information changes. But when disaster strikes, these key individuals and their knowledge can save a company. "Who is available? What skills do you have—or have [you] lost? If someone dies, whom can you replace them with?" Grobicki said.
"Communication is certainly the key, both in the planning and the execution," said Jaffe. He uses a service called Send Word Now from SWN Communications Inc. into which he puts contact information, including cell phone and beeper numbers. When disaster strikes, he can send out a blast to everyone from the Send Word Now Web site. Send Word Now includes a feature for confirming receipt of the communication.
Testing, Review and Assessment
"In a risk assessment, there is a whole lot more to it than people think. ... We need to think about lab rats."
National Institutes of Health
Once the plan has been drawn up, it must be tested under conditions that simulate a real disaster. And even after a company performs disaster recovery plan testing, your work is not done: Organizations change; the economy changes; risks change. All these factors must be re-evaluated regularly.
"Any time there's a significant change in a business, the risk officer needs to go back and look at the profile. You should look at the plan every year," said Stephanie Balaouras, an analyst at research firm Forrester Research Inc. Risk management software packages, she noted, will alert users when disaster recovery plans need to be updated and tested. This automation is beneficial, because plans created manually run the risk of being put on a shelf and forgotten.
In Rosen's case, his telework plan proved a lifesaver. But his recent flood has caused him to re-assess his organization's remote disaster recovery facility. At 40 miles distant, it could create a commuting hardship for some staffers should the facility have to be used for an extended period.
The re-assessment process has prompted many companies to consider alternatives for disaster recovery, including new cloud computing-based services, which are far less expensive than a redundant data center.
What's more, Stevenson noted, the cost of providing zero downtime, once prohibitively expensive for many companies, has become sufficiently affordable for many companies to give it a second look, thanks to the plummeting cost of computing equipment and global IT providers that operate in the middle of the night and halfway across the globe. "People in Singapore are logging on when U.S. customers are asleep. The cost of zero [downtime] is not as high as we might imagine anymore," Stevenson said.
With those new alternatives on the horizon, things look a bit brighter for IT professionals who think of – and plan for –the worst.
Stan Gibson is a Boston-based technology journalist. Write to him at firstname.lastname@example.org.