Saturday, July 16, 2011

Four lessons in IT disaster recovery planning from an FAA outage

What can CIOs learn about IT disaster recovery planning from the U.S. Federal Aviation Administration's (FAA) recent computer problems, which caused flight delays and cancellations at airports across the country? Plenty, say disaster recovery experts.

"Here we have a system that is vital to the flow of air traffic in the United States. It is hard to imagine how many dollars are riding on people getting to their destinations on time," said Gene Ruth, who covers disaster recovery (DR) at Midvale, Utah-based Burton Group Inc. "You have a failure in the network and there is no ability to [set] up a disaster recovery site immediately? That is completely unacceptable."

The root cause of the FAA outage, which lasted nearly five hours, was reportedly the failure of a circuit board inside a router at the FAA Telecommunications Infrastructure (FTI) facility in Salt Lake City. Details on why the backup router did not engage are still unavailable. The failure brought down a flight management system, forcing air traffic controllers to rely on faxes and emails to communicate flight plans.

You have to know you can deliver the service at some minimal level to keep you limping along and hopefully not, as in this case, stop air traffic for a third of the country.

Gene Ruth, analyst, Burton Group Inc.

The FAA attributed the outage to a software configuration problem, suggesting the single-component failure was compounded by a configuration management failure.

But the details of the incident hardly matter, DR experts said, compared with the IT disaster recovery planning lessons it holds. As CIOs make their annual pitch for IT DR funding -- a hard sell in any economy -- Ruth and others advised they keep the following four points in mind:

1. Equipment failure is the No. 1 reason for disaster recovery declarations.

Most IT disasters have nothing to do with the type of disaster that wipes out a facility, which is what many organizations consider when doing their IT disaster recovery planning. "This is a message I drive home to clients, especially when they are trying to justify DR to senior management," said analyst John Morency, a certified information systems auditor and research director at Stamford, Conn.-based Gartner Inc.

A recently published study from DR provider SunGard Availability Services LP showed that of the 2,250 disaster events SunGard handled in 2008, hardware failure accounted for 500 of them. That was well ahead of the second- and third-leading causes, hurricane and weather events (275) and power outages (213).

2. Equipment malfunctions compounded by change or configuration management failures are a double whammy.

"When you look at equipment malfunction, it is more than just hardware failing. Sometimes you have misapplied a change," Morency said. "It may be entirely possible that although the circuit board in the primary router went down, the [protocol] backup may not have been configured correctly, so it never took over." This indeed seems to be the case in the FAA incident.

"There has to be a lot stiffer penalties for production changes, be it for configuration or data, that are not rigorously tested prior to being introduced into production," Morency said.

Any upgrade or alteration of an existing system needs to be accompanied by an impact statement on the business continuity or DR plan, Burton Group's Ruth agreed. "Perhaps this FAA incident will turn out to be somebody making a change they thought was innocent -- fiddling with a database -- that brought the system down. But the lesson here is you don't allow technicians to go and make changes without including project management-like people to make sure there is an assessment of an impact on the operations of the business."

3. Testing for capacity is critical in IT disaster recovery planning.

"It sounds like one of the problems [in the FAA outage] is that the site that was left standing did not have the capacity to run the application. And that is startling, if folks had not put the analysis into whether the remaining site could handle the load," Ruth said.

Testing capacity and performance is "basic block and tackling," Ruth said. "You have to know you can deliver the service at some minimal level to keep you limping along and hopefully not, as in this case, stop air traffic for a third of the country."

4. But foolproof testing is sometimes impossible.

"In organizations where you have [a] merger and acquisition, where you have new production apps going through turnover, the scope of what needs to be tested keeps getting bigger and bigger. All of a sudden, the resources one would need, in terms of facilities, of support staff and business unit staff to perform those tests, also gets bigger," Morency said.

Even if the organization follows testing best practice, the amount of change in the data center can put it at risk, according to Morency. "The configuration that needs to be recovered may only require minor changes. But there could major differences," he said, "which is why a lot more organizations are asking the failover question versus the manual recovery question."

Let us know what you think about the story; email: Linda Tucci, Senior News Writer.

No comments:

Post a Comment