Business Continuity and Disaster Recovery

Slide 1 of 14 | CSP-W2-BCP | Week 2

Business Continuity and
Disaster Recovery

BCP vs DRP • BIA • Recovery Metrics • DR Sites • Backup Strategies • IR Lifecycle

When a ransomware attack encrypts your production servers at 2 AM, or a hurricane floods your data center, the question is not whether your organization will face a disaster -- it is whether you have a plan that survives first contact. Business continuity and disaster recovery are not aspirational -- they are operational requirements. This deck covers the planning, metrics, strategies, and testing that separate organizations that recover from those that do not.

14 Slides CSP-W2-BCP Week 2 CIS2208 -- Cybersecurity Policy

Slide 2 of 14

BCP vs DRP

Two related but distinct disciplines. BCP keeps the business running. DRP restores IT systems. Both are required.

Business Continuity Plan (BCP)

A comprehensive plan that ensures critical business functions can continue during and after a disaster. BCP is broader than IT -- it covers people, processes, facilities, supply chains, and communication. A hospital BCP addresses patient care with paper records if the EHR goes down. A bank BCP covers manual transaction processing. BCP asks: "How does the business keep operating?"

Disaster Recovery Plan (DRP)

A subset of BCP focused specifically on restoring IT infrastructure and data after a disruption. DRP covers server failover, database restoration, network recovery, and application restart procedures. DRP is technical and measurable -- it defines exact steps, sequences, and timelines for bringing systems back online. DRP asks: "How do we restore the technology?"

Critical Distinction

A DRP without a BCP means you can restore servers but have no plan for business operations during the outage. A BCP without a DRP means you have business workarounds but no path to restore technology. Organizations need both. NIST SP 800-34 provides the federal framework for integrating continuity and recovery planning.

Slide 3 of 14

Business Impact Analysis

The BIA is the foundation of every continuity plan. It identifies what matters most, how long you can survive without it, and what breaks first.

Identify Critical Functions

What processes generate revenue, serve customers, or meet legal obligations? Map every business function to its impact if unavailable for 1 hour, 4 hours, 24 hours, and 1 week. The answers drive every recovery priority.

Map Dependencies

Critical functions rarely stand alone. Payment processing depends on databases, networks, and third-party APIs. A dependency map reveals hidden single points of failure -- the one vendor, server, or person whose absence halts everything.

Quantify Financial Impact

Calculate the cost of downtime per hour for each function: lost revenue, regulatory fines, SLA penalties, reputational damage, and recovery labor. This financial data justifies the budget for continuity investments to executive leadership.

BIA is Not Optional

Without a BIA, recovery priorities are based on assumptions and politics rather than data. The department that shouts loudest gets restored first, not the function that matters most. A rigorous BIA transforms continuity planning from guesswork into engineering.

Slide 4 of 14

Recovery Metrics

Four numbers that define your recovery posture. Every DR contract, SLA, and backup policy references these metrics.

RPO

Recovery Point Objective -- the maximum acceptable data loss measured in time. An RPO of 1 hour means you can lose up to 1 hour of data. Drives backup frequency: RPO of 15 minutes requires near-continuous replication.

RTO

Recovery Time Objective -- the maximum acceptable downtime before business impact becomes unacceptable. An RTO of 4 hours means systems must be restored within 4 hours of failure. Drives DR site selection and failover architecture.

MTD

Maximum Tolerable Downtime -- the absolute limit beyond which the business suffers irreversible harm: regulatory penalties, permanent customer loss, or insolvency. MTD must always be greater than RTO. If MTD = RTO, there is zero margin for error.

MTBF

Mean Time Between Failures -- the average operational time between system failures. Calculated from historical data. Higher MTBF indicates more reliable systems. Used to forecast when failures will occur and to justify redundancy investments.

Why These Numbers Matter

Every DR contract, cloud SLA, and backup policy references RTO and RPO. If your vendor promises a 4-hour RTO but your BIA requires 1 hour, you have a gap. If your backup runs nightly but your RPO is 1 hour, you are exposed to 23 hours of potential data loss. These metrics translate business requirements into technical specifications.

Slide 5 of 14

Risk Assessment for Continuity

Continuity planning starts with understanding what can go wrong. Not all threats are equal -- and the most likely are rarely the most dramatic.

Natural Disasters

Hurricanes, earthquakes, floods, wildfires, tornadoes. Geographic risk varies dramatically -- a data center in Miami faces hurricane risk; one in San Francisco faces seismic risk. BCP must account for regional threat profiles. FEMA flood maps and USGS seismic data inform site selection.

Cyber Incidents

Ransomware, DDoS, data breaches, supply chain compromises. Cyber events are now the most common trigger for DR activation. The 2021 Colonial Pipeline ransomware attack shut down fuel distribution for the eastern United States for six days. Cyber-specific DRP is no longer optional.

Pandemic / Health Crisis

COVID-19 proved that workforce availability is a continuity risk. Organizations that could not support remote work within days faced existential threats. Pandemic BCP covers workforce health, remote access scaling, supply chain alternatives, and extended operational degradation.

Supply Chain Failure

Single-vendor dependency, chip shortages, cloud provider outages, SaaS vendor bankruptcy. The 2021 global chip shortage disrupted manufacturing for 18+ months. BCP must identify vendor dependencies and establish alternate sourcing. The SolarWinds attack proved that trusted suppliers can become attack vectors.

Infrastructure Failure

Power grid outages, ISP failures, cooling system failures, hardware end-of-life. The 2003 Northeast blackout affected 55 million people across eight states and Canada. Redundant power (UPS + generator), diverse ISP connections, and hardware refresh cycles are baseline continuity controls.

Human Factors

Key person dependency, accidental deletion, insider threat, labor actions. If one engineer holds all the passwords, their absence is a continuity event. Cross-training, documentation, and separation of duties reduce human-factor risk. The "bus factor" is a real metric.

Risk Assessment Reality

Organizations consistently underestimate the probability of common events and overestimate rare ones. You are far more likely to face a ransomware attack or ISP outage than an earthquake. Risk assessment must be data-driven, updated annually, and tied to actual incident history -- not worst-case imagination.

Slide 6 of 14

Disaster Recovery Site Strategies

Where do you go when your primary site is unavailable? Four options, each with different cost, speed, and capability tradeoffs.

Hot Site

A fully operational duplicate of the primary site with real-time data replication. Staff can switch operations within minutes. Highest cost but lowest RTO. Used by financial institutions, hospitals, and critical infrastructure where any downtime is unacceptable. Typical cost: 6-10x the cold site equivalent.

Warm Site

Partial infrastructure is in place -- hardware exists but may need configuration, software loading, and data restoration from recent backups. RTO of hours to a day. The compromise option: significantly cheaper than hot, significantly faster than cold. Most mid-size organizations land here.

Cold Site

An empty facility with power, cooling, and network connectivity but no hardware or data. Everything must be procured, installed, and configured after activation. RTO of days to weeks. Lowest cost but longest recovery. Suitable only for non-critical functions with high MTD tolerances.

Cloud-Based DR

Infrastructure spun up on demand in AWS, Azure, or GCP. Eliminates physical site management. Pay-per-use model means you only pay full cost during an actual disaster. Multi-region replication can achieve hot-site RTOs at warm-site costs. The modern default for most organizations, but requires careful planning for egress costs, bandwidth, and vendor lock-in.

Slide 7 of 14

Backup Strategies

Backups are the last line of defense. If everything else fails, backups determine whether you recover or start over.

Full Backup

Complete copy of all data every time. Simple to restore but consumes the most storage and takes the longest to run. Typically done weekly or monthly, with incremental or differential backups between cycles.

Incremental Backup

Only backs up data changed since the last backup of any type. Fastest and smallest. But restoration requires the last full backup plus every incremental since then, in order. Failure of any one incremental breaks the chain.

Differential Backup

Backs up all data changed since the last full backup. Grows larger each day but restoration only requires the last full plus the latest differential. Balances speed and restore simplicity between full and incremental.

Immutable Backups

Ransomware operators specifically target backups for encryption or deletion. Immutable backups cannot be modified or deleted for a defined retention period -- even by administrators. Cloud providers offer object lock (AWS S3 Object Lock, Azure Immutable Blob Storage). Air-gapped tape remains the gold standard for immutability. If your backups can be encrypted by the same ransomware that hit production, they are not backups -- they are liabilities.

Slide 8 of 14

Incident Response Lifecycle

NIST SP 800-61 defines six phases. Each phase feeds the next. Skipping phases guarantees repeat incidents.

01

Prepare

Policies, tools, team training, playbooks

02

Detect

SIEM alerts, anomaly detection, user reports

03

Contain

Isolate affected systems, limit blast radius

04

Eradicate

Remove threat actor, patch vulnerabilities

05

Recover

Restore systems, verify integrity, monitor

06

Learn

Post-incident review, update playbooks

The Lessons Learned Gap

Most organizations complete phases 1-5 but skip phase 6. Without a formal post-incident review, the same root causes produce the same incidents. The Lessons Learned phase feeds back into Preparation -- it is not the end of the cycle but the beginning of the next one. NIST mandates this feedback loop for a reason.

Slide 9 of 14

Crisis Communication Plans

When a disaster occurs, who talks to whom, when, and through what channel? Poorly managed communication turns a technical incident into a public relations disaster.

Internal Communication

Incident response team activation, executive notification chains, employee updates. Define primary and backup communication channels -- if email is down, what do you use? Pre-draft templates for common scenarios. Establish a cadence: initial alert within 15 minutes, situation updates every 2 hours.

Stakeholder Notification

Customers, partners, vendors, board of directors. Each stakeholder group needs different information at different times. Customers need to know if their data is affected. Vendors need to know if SLAs are at risk. The board needs to understand financial exposure. Over-communicating is always better than silence.

Regulatory Reporting

GDPR requires breach notification within 72 hours. HIPAA requires notification within 60 days. State breach notification laws vary -- some require notification within 30 days. SEC rules require material cybersecurity incidents to be disclosed in 8-K filings within 4 business days. Missing a deadline creates legal liability on top of the original incident.

Media / Public Relations

Designate a single spokesperson. Pre-approve messaging templates with legal counsel. Never speculate about root cause, scope, or timeline in public statements. Acknowledge the incident, describe actions being taken, and provide a timeline for updates. Equifax's botched breach disclosure in 2017 caused more reputational damage than the breach itself.

Law Enforcement

FBI (IC3), CISA, and local law enforcement for criminal cyber incidents. Preserve evidence before contacting -- chain of custody matters. File reports through official channels: FBI's Internet Crime Complaint Center for cyber incidents, local police for physical incidents. Cooperation with law enforcement may be legally required for certain incident types.

The Communication Paradox

During a crisis, communication bandwidth is lowest precisely when communication needs are highest. Pre-built contact lists, message templates, escalation trees, and channel redundancy must exist before the incident. If you are figuring out who to call during the disaster, you have already failed.

Slide 10 of 14

Testing and Exercises

An untested plan is not a plan -- it is a document. Testing reveals gaps, builds muscle memory, and validates assumptions before a real disaster does.

Tabletop Exercise

A facilitated discussion where team members walk through a hypothetical scenario. No systems are touched. Low cost, easy to schedule, and effective at identifying communication gaps, role confusion, and plan deficiencies. Should be conducted at least quarterly. Common scenarios: ransomware attack, cloud provider outage, insider data theft.

Structured Walkthrough

Team members step through the actual plan document, procedure by procedure, verifying that each step is accurate, contact information is current, and dependencies are documented. More structured than a tabletop but still discussion-based. Often reveals outdated procedures and missing runbooks.

Functional Simulation

A realistic drill where teams execute recovery procedures in a test environment. Systems are actually failed over, backups are restored, communication trees are activated. Tests both the plan and the people. Reveals timing gaps -- the plan says "restore from backup" but how long does that actually take?

Full Interruption Test

The production system is actually taken offline and recovery procedures are executed for real. The most realistic test but the highest risk -- if recovery fails, you have a real outage. Reserved for mature organizations with high confidence in their plans. Financial regulators (FFIEC) may require full interruption tests annually.

Slide 11 of 14

Pandemic Lessons

COVID-19 exposed fundamental assumptions in business continuity planning. The organizations that survived had plans that accounted for workforce disruption at scale.

VPN and Remote Access

Organizations designed for 10% remote work suddenly needed 100%. VPN concentrators saturated. Split-tunnel vs full-tunnel decisions had security implications. Zero Trust architectures proved more resilient than VPN-dependent models. Lesson: design remote access for surge capacity, not steady state.

Supply Chain Resilience

Just-in-time inventory models failed when global shipping stopped. Hardware refresh cycles extended by 6-12 months. Single-source vendors became single points of failure. Lesson: BCP must include vendor diversification, safety stock for critical components, and alternate sourcing agreements.

The Assumption That Broke

Pre-COVID BCPs assumed disasters were localized and temporary -- a hurricane hits one site, you fail over to another. COVID was global and sustained. Every site was affected simultaneously. Every vendor was affected simultaneously. The lesson: BCP must account for scenarios where there is no unaffected site to fail over to.

Slide 12 of 14

Regulatory Requirements

Business continuity is not optional in regulated industries. Non-compliance carries penalties that can exceed the cost of the disaster itself.

Framework	Sector	BCP/DR Requirements
FFIEC	Financial	Comprehensive BCP required for all FDIC-insured institutions. Must include BIA, risk assessment, testing, and board-approved plans. Annual testing with full interruption tests expected. Examiners review BCP during every safety and soundness exam.
HIPAA	Healthcare	Contingency Plan standard (45 CFR 164.308(a)(7)) requires data backup, disaster recovery, and emergency mode operation plans. Must maintain exact copies of ePHI and procedures for restoring systems. Testing and revision are required addressable specifications.
SOX	Public Companies	Section 404 requires internal controls over financial reporting, which includes IT continuity. Auditors assess whether financial systems can survive disruption. Failure to demonstrate adequate DR controls can result in material weakness findings.
GDPR Art 32	EU Data	Requires "the ability to restore the availability and access to personal data in a timely manner in the event of a physical or technical incident." Also requires regular testing of security measures. Fines up to 4% of global annual revenue or 20 million euros.
NIST SP 800-34	Federal	Contingency Planning Guide for Federal Information Systems. Defines the seven-step process: policy, BIA, preventive controls, recovery strategies, plan development, testing, and maintenance. Required for all federal systems under FISMA.
ISO 22301	International	The international standard for business continuity management systems (BCMS). Certifiable standard that provides a framework for planning, establishing, implementing, operating, monitoring, reviewing, and maintaining a BCMS.

Compliance is the Floor, Not the Ceiling

Meeting regulatory minimums does not mean your organization will survive a disaster. Regulations define the baseline -- what you must do to avoid penalties. Effective BCP/DR goes beyond compliance to address your specific risk profile, operational dependencies, and recovery capabilities. The goal is resilience, not checkboxes.

Slide 13 of 14

Key Takeaways

The principles that separate organizations that recover from those that do not.

1 BCP keeps the business running during a disaster. DRP restores the technology. Both are required -- neither is sufficient alone. DRP is a subset of BCP.

2 The Business Impact Analysis (BIA) is the foundation. Without it, recovery priorities are driven by politics instead of data. Every continuity decision flows from the BIA.

3 RPO defines acceptable data loss. RTO defines acceptable downtime. MTD is the absolute limit. These metrics translate business requirements into technical specifications for every backup and DR contract.

4 DR site selection (hot, warm, cold, cloud) is a cost-vs-RTO tradeoff. Cloud-based DR is the modern default, offering hot-site capabilities at warm-site costs -- but requires careful planning for egress and vendor lock-in.

5 The 3-2-1 backup rule is the minimum standard. Immutable backups are essential in the ransomware era -- if your backups can be encrypted by the same attack, they are not backups.

6 The NIST incident response lifecycle has six phases. Most organizations skip Lessons Learned, which guarantees repeat incidents. The cycle is a loop, not a line.

7 An untested plan is not a plan. Testing ranges from tabletop exercises (low cost) to full interruption tests (high realism). Start with tabletops quarterly and work up.

8 COVID-19 broke the assumption that disasters are localized and temporary. Modern BCP must account for global, sustained disruption with no unaffected failover site.

What Comes Next

These concepts are not theoretical -- they are operational requirements tested by every major incident. When you write cybersecurity policy, every control you recommend either supports or undermines your organization's ability to continue operating and recover from disruption. BCP and DRP are the bridge between policy on paper and survival in practice.

Slide 14 of 14 | Complete

Presentation
Complete

Business Continuity and Disaster Recovery -- 14 slides

BCP vs DRP • BIA • RTO/RPO/MTD/MTBF • Risk Assessment • DR Sites • 3-2-1 Backups • IR Lifecycle • Crisis Comms • Testing • Pandemic Lessons • Regulatory

CIS2208 Cybersecurity Policy Week 2

Business Continuity and Disaster Recovery | Cybersecurity Policy