High Availability & Disaster Recovery

Slide 1 of 8 | N10-009 Obj 3.3 | HA/DR

High Availability &
Disaster Recovery

How long can you afford to be down? The answer drives every decision here.

Friday at 2 AM, the primary database server fails. The CEO calls Monday morning: "How much data did we lose? How long were we down?" Your answers depend on whether you planned for this moment or just hoped it would not happen.

8 Slides N10-009 Obj 3.3 Availability + Recovery Concepts + Context

Slide 2 of 8

The Four DR Metrics

These numbers define your recovery strategy. Know them for the exam and for real life.

RPO

Recovery Point Objective -- maximum acceptable data loss measured in time. "We can lose up to 4 hours of data."

RTO

Recovery Time Objective -- maximum acceptable downtime. "We must be back online within 2 hours."

MTTR

Mean Time To Repair -- average time to fix a failed component and restore service.

MTBF

Mean Time Between Failures -- average time a system runs before failing. Higher = more reliable.

How They Relate

RPO answers "how much data can we lose?" RTO answers "how long can we be down?" MTTR measures how fast your team actually fixes things. MTBF predicts when the next failure will occur. Lower RPO/RTO costs more money -- hot sites, continuous replication, redundant everything.

The CEO asks: "How much did we lose?" If your last backup was 4 hours ago and your RPO is 1 hour, you failed your RPO target by 3 hours. That is a process failure, not a technology failure.

Slide 3 of 8

DR Sites: Hot, Warm, Cold

Three tiers of disaster recovery readiness. Cost increases with readiness.

Hot Site

Fully operational mirror of production. Real-time data replication. Failover in minutes. Highest cost. Used when RTO must be near-zero (financial services, healthcare).

Warm Site

Hardware is installed and partially configured. Data is replicated periodically (not real-time). Failover in hours to a day. Moderate cost. Good balance for most organizations.

Cold Site

Empty facility with power, cooling, and connectivity. No hardware pre-installed. Failover takes days to weeks. Lowest cost. Acceptable only when extended downtime is tolerable.

Cloud DR

Cloud providers have made hot-site DR far more accessible. Instead of leasing a second data center, you can replicate to another region. The infrastructure exists on-demand -- you only pay for storage until you need compute.

Slide 4 of 8

Active-Active vs Active-Passive

Two architectures for keeping services running when hardware fails.

Active-Active

Both nodes handle traffic simultaneously. A load balancer distributes requests across all active members. If one fails, the other absorbs the full load -- no failover delay. Better resource utilization. Requires that both nodes stay synchronized (session state, data).

Active-Passive

One node handles all traffic. The standby monitors and waits. When the primary fails, the passive node takes over (failover). Simpler to implement. The standby is idle until needed -- wasted capacity during normal operations. Failover may take seconds to minutes depending on detection and promotion time.

Exam Tip

If the question mentions "both nodes handling traffic" or "load distribution" -- active-active. If it mentions "standby" or "failover" -- active-passive. Active-active provides higher availability but requires more complex state synchronization.

First Hop Redundancy: FHRP protocols (HSRP, VRRP, GLBP) provide gateway failover at Layer 3. HSRP/VRRP are active-passive. GLBP is active-active -- it load-balances across multiple gateways.

Slide 5 of 8

Testing Your DR Plan

A plan that has never been tested is just a document.

Tabletop Exercise

Discussion-based walkthrough. Team members talk through their roles in a disaster scenario without touching any systems. Identifies gaps in procedures and communication chains. Low risk, low cost.

Simulation / Walkthrough

More detailed than tabletop. Team follows the actual runbook step-by-step, verifying that documentation matches reality. May identify missing credentials, outdated contacts, or unclear escalation paths.

Full Validation / Failover Test

Actually fail over to the DR site. Production traffic runs on the backup infrastructure. Proves the plan works under real conditions. Highest risk -- if the DR site fails too, you are exposed.

Best Practice

Run tabletop exercises quarterly. Run simulation tests semi-annually. Run full failover tests annually. Document every finding and update the DR plan immediately after each test.

Slide 6 of 8

Backup Types: Full, Differential, Incremental

Different strategies trade off storage space, backup speed, and restore complexity.

Full Backup

Copies all data every time. Slowest to perform, uses the most storage. Fastest to restore -- just load the single backup. Clears the archive bit on all files.

Differential Backup

Copies all data changed since the last full backup. Grows larger each day. To restore: last full + latest differential (2 sets). Does NOT clear the archive bit.

Incremental Backup

Copies only data changed since the last backup of any type. Smallest, fastest backup. To restore: last full + every incremental since (multiple sets). Clears the archive bit.

Exam Scenario

Full backup on Sunday. Differential daily. Failure on Thursday. Restore = Sunday full + Wednesday differential (2 sets). Same scenario with incremental: Restore = Sunday full + Monday incremental + Tuesday incremental + Wednesday incremental (4 sets). Differential is faster to restore. Incremental is faster to back up.

Slide 7 of 8

Redundancy: NICs, Clusters, Power

Eliminate single points of failure at every layer of the infrastructure.

NIC Teaming (Link Aggregation)

Combine multiple NICs into one logical interface. Provides bandwidth aggregation and failover. If one NIC fails, traffic continues on the remaining NICs. Uses LACP (802.3ad) for negotiation.

Clustering

Multiple servers act as a single system. If one node fails, the cluster redistributes workloads. Common for databases, application servers, and hypervisors. Requires shared storage or data replication.

Multipathing

Multiple physical paths to storage (SAN). If one path fails, I/O continues on an alternate path. Prevents storage connectivity from becoming a single point of failure.

UPS (Uninterruptible Power Supply)

Battery backup that provides immediate power during an outage. Bridges the gap until the generator starts (typically 10-30 seconds). Also conditions power to protect against surges and sags.

Generator

Provides long-term power during extended outages. Runs on diesel or natural gas. Takes seconds to start -- the UPS covers the gap. Regular testing and fuel management are critical.

Slide 8 of 8 | N10-009 Obj 3.3

HA/DR -- Key Takeaways

Monday morning, the CEO calls. You answer: "We lost 45 minutes of data -- within our 1-hour RPO target. The warm site was online in 3 hours -- within our 4-hour RTO. Full backup from Sunday plus Wednesday's differential restored the database. NIC teaming and LACP kept the network path alive during the failover."

5 Facts to Carry Out of This Presentation

1 RPO = max data loss. RTO = max downtime. MTTR = avg repair time. MTBF = avg time between failures.

2 Hot site = minutes (real-time replication). Warm site = hours (periodic sync). Cold site = days (empty facility).

3 Active-active = both nodes serve traffic. Active-passive = standby waits for failover.

4 Full = everything. Differential = since last full. Incremental = since last backup of any type.

5 NIC teaming (LACP) aggregates links. UPS bridges to generator. Test DR plans regularly -- tabletop, simulation, full failover.

High Availability & Disaster Recovery N10-009