By Douglas Lantigua, Principal at MUSA Technology Partners
Effectively solving the problem of corporate Disaster Recovery and Business Continuity (DR/BC) starts with proper planning and networking. A company that owns only a few servers or a complete datacenter will need a failover location and a plan known as the Run Book. The other location can be a collocation facility, another business location, other service providers or some sort of hybrid. The business needs to address key questions such as:
- How much data can we afford to lose in the event of a system failure? The amount of possible data loss, measured by either data or time, will help direct the DR/BC solution.
- The Recovery Point Objective (RPO). The RPO is the acceptable level of data loss measured in time (i.e. 5 minutes or 4 hours). The RPO is married to the Recovery Time Objective (RTO) which is the amount of time it takes to get critical systems back into a functional state.
How does networking assist in the RTO and RPO objectives? Larger companies can take advantage of Virtual Private LAN Service (VPLS) to extend the datacenter network to another physical location. Unlike its cousin Multiprotocol Label Switching (MPLS), VPLS acts on a lower level of network activity. With VPLS, users and computers connect to systems by name (a friendly translation of an IP address). MPLS works at the IP level. Geographically dispersed locations connected by MPLS need to have different networks (or IP address ranges). With MPLS, even if you fail a server over to another location using the latest application technology, you will still need to change the IP address of the server, and you will probably need to change dozens of other attributes in the network modified to bring the system back into an operational state. The failover procedure and plan are compiled and updated in the Run Book.
VPLS’s key advantage is that it works at a lower network level than the IP address; it works at the machine address (MAC address). This makes the IP address transferrable anywhere in the network. So a failover of a system can move geographical locations and still maintain its’ IP address and remain reachable by users and computers alike. VPLS can be expensive, but there are alternatives for companies on a budget and those who do not need the large bandwidth requirements most VPLS providers mandate. Companies with sub-VPLS requirements can use IP tunneling and/or channeling to achieve the same goals. By extending the network across a geographically dispersed location(s) at the machine address level (layer 2), you allow the IP addresses of the servers to move freely. The latest in virtualization technology and storage replication makes an aggressive RTO and RPO very inexpensive.
The Run Book is the instruction and procedure plan on how to handle DR/BC scenarios. Given in previous scenarios of failovers where the IP needs to change of the system, the dependency on the IP address can be far reaching. Not only would the server need to change its IP address, but the name to IP (DNS) relationship, connections to data sources, internal application settings and finally, the end user network path to the server/service– which could include dozens of pieces of network gear, will all need to be updated. These types of systems are set up over days/weeks or months when originally deployed, an emergency change under tight deadline for a single system could be difficult even under perfect preparation. Then assume people get busy and the Run Book doesn’t get updated when changes occur. The Run Book then becomes a massive paperweight and budget nightmare to maintain effectively.
Leveraging a geographically dispersed layer 2 network either by VPLS or IP tunneling/channeling shrinks the DR/BC run book, allows the staff to fix the original problem and frees engineers to solving unforeseen issues. Any failover involving IP address changes is fraught with time consuming issues in order to bring missing critical systems back online. Those industries with heavy compliance requirements are in need of simple solutions to meet regulation standards. The networking base does include an upfront investment for setup and enough bandwidth for failover. Managers must maintain routine checks that enough bandwidth is available for a catastrophic failover event of critical systems. Secondary access points should be considered to the failover location if key users will need to perform their job function from outside your network for a prolonged period of time. Routine testing of failovers should be part of the standard operating procedure (SOP) of the IT/IS department. The network is only one part of the overall picture. With a flexible, geographically dispersed network the ground is fertile for system and application failover tools to work their magic with the least complications to achieve success.