burnout Disaster recovery and business continuity

IT manages disaster recovery, but business owners own the business continuity.

In business, uptime is everything. If your systems are down, you can't process customer orders, you can't respond to client requests. For the time that your servers are offline, you are dead in the water.

For example, consider a simple online business that supports websites, using a single web server, a single database server, and a single storage frame. While this configuration is too simple for a modern business, we can use it as a model to understand how organizations can plan for and respond to outages.

Disaster recovery planning

One way around this is with disaster recovery planning. In planning for disasters, organizations plan for redundancy that can be brought online if the first site becomes unavailable. This redundancy comes with a hefty price tag. Double the servers means double the price: a second web server, a second database server, and a second storage frame—all located at a second data center facility.

Most of the time (and hopefully all of the time) the secondary systems remain idle. The redundant servers don't need to be brought online unless the primary systems experience a dramatic outage, so for most of the hardware's lifecycle, the secondary systems will remain unused. Organizations can leverage the secondary systems by utilizing them in some other capacity. For example, an IT organization might move "production" to the secondary servers when the primary systems need regular maintenance, such as operating system patches, application software updates, or other scheduled work. But this doesn't leverage the spare hardware to much of an extent.

Instead, organizations might opt to use the secondary systems for a "test" environment. After all, if the spare systems are meant to be a mirror of the production servers, these secondary systems would be a perfect host to test new software releases. If the new software works on the "test" system, it should also work on the "production" systems. Utilizing the spare systems in this way means they will get used more frequently, which means the redundant systems will remain more closely mirrored to the primary systems.

During an emergency that causes an unexpected outage on the main business systems, the IT team can quickly roll over "production" to run from the "test" systems. The time required to do this depends somewhat on the complexity of the configuration, but mostly on the TTL (time to live) values of network entries; client systems will still have the IP address of the old "production" systems until the TTL expires, then they will start to access the new "production" systems at the secondary data center. Network engineering can minimize this time, or eliminate it entirely through clever network routing rules.

Another way to provide failure protection is to build the redundancy directly in the design of the system. In the example of a business with a web server, database, and storage frame, that would require running two web servers at different locations at the same time, and balancing traffic between them using technology such as a load balancer or other network-level routing. A new visitor to the website gets routed to either server "A" or "B" depending on load; if server "A" suddenly becomes unavailable, the network sends all visitor traffic to server "B" instead. Similarly, a database system and storage system can be configured to run in a similar "dual" mode—a simpler proposition when the data is read-only, but possible for making updates too.

These are examples of providing near-zero downtime. Most modern businesses rely on architectures designed like this, where systems can detect failed systems on their own and route traffic automatically.

Whether the organization uses primary and secondary systems, or uses dual-production systems, the IT team holds responsibility for ensuring server uptime and responding to hardware and software issues as they arise.

Business continuity planning

Outside the scope of the IT team is business continuity planning. This is the process by which an organization can continue to operate its business in the face of IT failures. And with IT risks on the rise such as ransomware and hacking, the IT team has a full plate.

But unless the organization provides a technology service, IT is likely not the business. Instead, IT exists to support the business. As a result, IT cannot represent how the business will provide continuity during an outage.

The business continuity "plan" cannot be to get in front of the CIO and demand that certain systems be brought online immediately or moved to the front of the recovery queue. It takes time to restore everything during a total loss; systems can copy data from backup storage systems or tape only so quickly. 

How the organization continues to maintain some level of business operation during a system outage is the responsibility of the business owner. This business continuity can take various forms, from routing paper documents to using "low-tech" solutions such as spreadsheets to track customers during the outage. After the IT systems are restored to their normal function and become available again, the business must also re-enter the offline data back into the applications.

Business owners need to work closely with IT to ensure that they understand the metrics surrounding the disaster recovery process. Most business owners will be concerned about how long it will take to restore data from backup and bring systems back online (RTO, or recovery time objective) and how "old" is the data that is used to recover systems and applications (RPO, or recovery point objective).

What is your business continuity?

Take this opportunity to review processes and procedures in your own organization. What does your Disaster Recovery plan look like? Do you have one? Does everyone in the IT organization know how to bring systems back online in the face of failure?

At the same time, what does your Business Continuity plan look like? This needs to come from your business units, to plan how they will keep operating if the technology becomes unavailable. Every business unit needs to create and manage their own plan about how they will continue to do business in the sudden absence of technology.