Disaster Recovery Horrors

rolling dice

Disaster Recovery – stories of it being done wrong..

They said it wouldn’t happen to them. Or couldn’t. That they were prepared. That they had taken steps.

They were wrong.

Computer systems are, with the possible exception of phone systems, the ‘soft underbelly’ of most small businesses. You would be amazed at how a fifty person hedge fund that manages over a billion dollars in client assets are loathe to spend $25,000 on disaster preparedness. Such a trifling sum to guarantee their business in the event of a disaster! And yet I see this miserly conduct happen over, and over again.

As an Information Security professional for the past 25 years, with over fifteen years spent at large enterprises such as Merrill Lynch and Ernst and Young, and ten years as an independent business continuity/high availability infrastructure consultant for small and mid-size businesses, I’ve seen a lot of solutions that worked – and a lot that haven’t. There are a large range of available options between the mirrored trading floors that the big brokerage houses maintain, to the cheap and flimsy usb hard drive or ancient DLT backup tapes that, sadly, are all that pass for ‘business continuity’ solutions at many firms. And most good solutions don’t need to cost a heavy price.

Over the past two years, I’ve seen six companies that I either consulted with, or was speaking about consulting with, go out of business or were forced to layoff more than 50% of staff because of bad planning and bad luck. In all but one of these cases, they could have avoided such catastrophic losses through simple, yet too often overlooked precautions.

The first case that springs to mind, is that of a 10 year old investment firm located in Manhattan on 41st and Lexington Avenue. Two years ago, they (or their investors) decided to investigate a business continuity strategy. They had a few bad experiences with 9/11 and the NYC blackout, and didn’t want to get caught short again. So I spent an hour discussing their DR/BC planning with them – only to find that they not only did not have one, but the one thing they did have – backup tapes – were not being taken off site. When I inquired why, they said that the secretary tasked with this duty often ‘forgot’, but that it wasn’t a big deal. Their attitude was – that so long as the tapes existed (never mind they had never tested any of them, nor even checked to see if their backups actually finished) they could ‘somehow’ recover from a disaster. When I probed deeper, and found that their actual tolerance to complete system downtime was less than 24 hours, I realized that these guys needed some help.

Well, they decided to put action ‘on hold.’ Too busy with other projects, they said. Their shortsightedness would cost them big, when less than three months later, a water main burst in the street outside their building (for those living in New York, you must remeber this one – Lexington and the west side of 3rd avenue were closed to cars from 39th st to 41st for over 2 weeks). Their safety net – their tapes – were trapped in a building now ruled completely unsafe for entry by Coned and the fire department. For 6 days, this company had no access to data, no servers, no receivables, no plan. Needless to say, they suffered – and quite badly. The firm is no longer in business. All for want of simply taking a tape offsite, and having some idea of what to do with it when disaster struck.

Another sad case comes to mind – this time, a twenty five year old professional services firm. They did indeed have a rudimentary disaster plan – complete with offsite tape storage, and mirrored servers at the presidents house. However, when I met with them, it turned out that they had never actually run a full-scale test of the system; it had been tested by their network administrator in his ‘lab’, and the machines (which were running Doubletake for Solomon, Exchange, File/Print, and BES) were shipped to the owner’s house and left in his basement connected to his inexpensive Linksys ‘Compusa special’ hub.

I advised a full-on test of the system, and recommended relocating the servers to a hosted, generator backed facility in a 1/2 rack (for about $700/month). I cringed when I heard that the owner’s home was up in the country, never visited by their IT admin, and was subject to occassional power outages. I also recommended a full-on test of their backup media and creation of a detailed recovery plan (their ‘plan’ was to have their key 10 or so employees remote in to their CEO’s home servers).

Well, the company decided not to spend the money ‘at present’ and chose to stay with their existing solution, which their NetAdmin arrogantly told me was ‘only for the suit’s piece of mind anyway’. How right he was! This was clearly demonstrated when, about six months later, the floor directly above their server room suffered a broken water pipe that quickly flooded down into their server room (floods are the #1 cause of disasters I have seen over the past 5 years) and took out their entire server rack, their phone system, their UPS system, and the room AC.

They frantically kicked their ‘Disaster Recovery’ plan into operation – only to find that in the intervening two years since they set it up the CEO had changed internet providers and the static IP addresses they had set up were no longer valid. So, their plan was DOA – no one was remoting in anywhere. And since the IP’s at his house had changed, Doubletake had not completed a successful replication in over ten months (the Network admin later told me he was counting on ‘alerts’ to tell him if replication failed – only, they were never setup). Furthermore, the girl who took their backup tapes home every day reported in sick that day, so there was a 1/2 day delay in getting the tapes from her. And finally, when they were sent up to his house in Westchester County (by this time – two full days later – they had gotten new static IP addresses) the tape drive at his home was dead.

So they wasted another day getting new tapes and patching the servers up to present spec. Five full days after the flood, they finally had data flowing again (of course, their terminal services weren’t set up properly, the firewall at his home wasn’t configured right, and the slow upload speeds couldn’t handle more than about 3 concurrent users of Solomon at once). This comedy of errors didn’t end with the company going out of business, but it did end painfully for them – with almost 30% of their workforce gone since I was last there (they did eventually contract out for my services though!)

The moral of this story is simple – disaster recovery and business continuity can be done – and should be done – by all businesses that need their computer systems to conduct business.

The LCO Group (http://www.thelcogroup.com) is a premiere, high-end consulting firm in New York City, that offers a wide array of IT consulting services from desktop and network support to business continuity/disaster recovery planning and security/risk exposure assessment.

At The LCO Group, we do Technology right.