Business Continuity and Disaster Recovery | Cybersecurity Policy

Slide 1 of 14  |  CSP-W2-BCP  |  Week 2
Business Continuity and
Disaster Recovery
BCP vs DRP  •  BIA  •  Recovery Metrics  •  DR Sites  •  Backup Strategies  •  IR Lifecycle
When a ransomware attack encrypts your production servers at 2 AM, or a hurricane floods your data center, the question is not whether your organization will face a disaster -- it is whether you have a plan that survives first contact. Business continuity and disaster recovery are not aspirational -- they are operational requirements. This deck covers the planning, metrics, strategies, and testing that separate organizations that recover from those that do not.
14 Slides CSP-W2-BCP Week 2 CIS2208 -- Cybersecurity Policy
Slide 2 of 14
BCP vs DRP
Two related but distinct disciplines. BCP keeps the business running. DRP restores IT systems. Both are required.
BCP Business operations Alternate processes Stakeholder comms Supply chain Manual workarounds DRP IT infrastructure Data restoration System failover Network recovery Technical playbooks Minimize downtime Protect assets
Business Continuity Plan (BCP)
A comprehensive plan that ensures critical business functions can continue during and after a disaster. BCP is broader than IT -- it covers people, processes, facilities, supply chains, and communication. A hospital BCP addresses patient care with paper records if the EHR goes down. A bank BCP covers manual transaction processing. BCP asks: "How does the business keep operating?"
Disaster Recovery Plan (DRP)
A subset of BCP focused specifically on restoring IT infrastructure and data after a disruption. DRP covers server failover, database restoration, network recovery, and application restart procedures. DRP is technical and measurable -- it defines exact steps, sequences, and timelines for bringing systems back online. DRP asks: "How do we restore the technology?"
Critical Distinction
A DRP without a BCP means you can restore servers but have no plan for business operations during the outage. A BCP without a DRP means you have business workarounds but no path to restore technology. Organizations need both. NIST SP 800-34 provides the federal framework for integrating continuity and recovery planning.
Slide 3 of 14
Business Impact Analysis
The BIA is the foundation of every continuity plan. It identifies what matters most, how long you can survive without it, and what breaks first.
BUSINESS IMPACT RECOVERY URGENCY HIGH LOW HIGH LOW CRITICAL Payment processing Patient care systems Auth / identity services RTO: 0-4 hours IMPORTANT Email systems ERP / CRM platforms Reporting dashboards RTO: 4-24 hours TIME-SENSITIVE VoIP / phone systems Badge access systems RTO: 24-72 hours DEFERRABLE Training portals Internal wikis RTO: 72+ hours
Identify Critical Functions
What processes generate revenue, serve customers, or meet legal obligations? Map every business function to its impact if unavailable for 1 hour, 4 hours, 24 hours, and 1 week. The answers drive every recovery priority.
Map Dependencies
Critical functions rarely stand alone. Payment processing depends on databases, networks, and third-party APIs. A dependency map reveals hidden single points of failure -- the one vendor, server, or person whose absence halts everything.
Quantify Financial Impact
Calculate the cost of downtime per hour for each function: lost revenue, regulatory fines, SLA penalties, reputational damage, and recovery labor. This financial data justifies the budget for continuity investments to executive leadership.
BIA is Not Optional
Without a BIA, recovery priorities are based on assumptions and politics rather than data. The department that shouts loudest gets restored first, not the function that matters most. A rigorous BIA transforms continuity planning from guesswork into engineering.
Slide 4 of 14
Recovery Metrics
Four numbers that define your recovery posture. Every DR contract, SLA, and backup policy references these metrics.
DISASTER RPO Last backup RTO Systems restored MTD Business fails MTBF -- avg time between failures TIME
RPO
Recovery Point Objective -- the maximum acceptable data loss measured in time. An RPO of 1 hour means you can lose up to 1 hour of data. Drives backup frequency: RPO of 15 minutes requires near-continuous replication.
RTO
Recovery Time Objective -- the maximum acceptable downtime before business impact becomes unacceptable. An RTO of 4 hours means systems must be restored within 4 hours of failure. Drives DR site selection and failover architecture.
MTD
Maximum Tolerable Downtime -- the absolute limit beyond which the business suffers irreversible harm: regulatory penalties, permanent customer loss, or insolvency. MTD must always be greater than RTO. If MTD = RTO, there is zero margin for error.
MTBF
Mean Time Between Failures -- the average operational time between system failures. Calculated from historical data. Higher MTBF indicates more reliable systems. Used to forecast when failures will occur and to justify redundancy investments.
Why These Numbers Matter
Every DR contract, cloud SLA, and backup policy references RTO and RPO. If your vendor promises a 4-hour RTO but your BIA requires 1 hour, you have a gap. If your backup runs nightly but your RPO is 1 hour, you are exposed to 23 hours of potential data loss. These metrics translate business requirements into technical specifications.
Slide 5 of 14
Risk Assessment for Continuity
Continuity planning starts with understanding what can go wrong. Not all threats are equal -- and the most likely are rarely the most dramatic.
Natural Disasters
Hurricanes, earthquakes, floods, wildfires, tornadoes. Geographic risk varies dramatically -- a data center in Miami faces hurricane risk; one in San Francisco faces seismic risk. BCP must account for regional threat profiles. FEMA flood maps and USGS seismic data inform site selection.
Cyber Incidents
Ransomware, DDoS, data breaches, supply chain compromises. Cyber events are now the most common trigger for DR activation. The 2021 Colonial Pipeline ransomware attack shut down fuel distribution for the eastern United States for six days. Cyber-specific DRP is no longer optional.
Pandemic / Health Crisis
COVID-19 proved that workforce availability is a continuity risk. Organizations that could not support remote work within days faced existential threats. Pandemic BCP covers workforce health, remote access scaling, supply chain alternatives, and extended operational degradation.
Supply Chain Failure
Single-vendor dependency, chip shortages, cloud provider outages, SaaS vendor bankruptcy. The 2021 global chip shortage disrupted manufacturing for 18+ months. BCP must identify vendor dependencies and establish alternate sourcing. The SolarWinds attack proved that trusted suppliers can become attack vectors.
Infrastructure Failure
Power grid outages, ISP failures, cooling system failures, hardware end-of-life. The 2003 Northeast blackout affected 55 million people across eight states and Canada. Redundant power (UPS + generator), diverse ISP connections, and hardware refresh cycles are baseline continuity controls.
Human Factors
Key person dependency, accidental deletion, insider threat, labor actions. If one engineer holds all the passwords, their absence is a continuity event. Cross-training, documentation, and separation of duties reduce human-factor risk. The "bus factor" is a real metric.
Risk Assessment Reality
Organizations consistently underestimate the probability of common events and overestimate rare ones. You are far more likely to face a ransomware attack or ISP outage than an earthquake. Risk assessment must be data-driven, updated annually, and tied to actual incident history -- not worst-case imagination.
Slide 6 of 14
Disaster Recovery Site Strategies
Where do you go when your primary site is unavailable? Four options, each with different cost, speed, and capability tradeoffs.
COST RTO HOT SITE Fully operational Real-time replication RTO: minutes $$$$ WARM SITE Partial hardware Periodic backups RTO: hours-days $$$ COLD SITE Empty facility Power + connectivity RTO: days-weeks $$ CLOUD DR On-demand infra Pay-per-use Multi-region RTO: min-hours $-$$$$ (scales with config)
Hot Site
A fully operational duplicate of the primary site with real-time data replication. Staff can switch operations within minutes. Highest cost but lowest RTO. Used by financial institutions, hospitals, and critical infrastructure where any downtime is unacceptable. Typical cost: 6-10x the cold site equivalent.
Warm Site
Partial infrastructure is in place -- hardware exists but may need configuration, software loading, and data restoration from recent backups. RTO of hours to a day. The compromise option: significantly cheaper than hot, significantly faster than cold. Most mid-size organizations land here.
Cold Site
An empty facility with power, cooling, and network connectivity but no hardware or data. Everything must be procured, installed, and configured after activation. RTO of days to weeks. Lowest cost but longest recovery. Suitable only for non-critical functions with high MTD tolerances.
Cloud-Based DR
Infrastructure spun up on demand in AWS, Azure, or GCP. Eliminates physical site management. Pay-per-use model means you only pay full cost during an actual disaster. Multi-region replication can achieve hot-site RTOs at warm-site costs. The modern default for most organizations, but requires careful planning for egress costs, bandwidth, and vendor lock-in.
Slide 7 of 14
Backup Strategies
Backups are the last line of defense. If everything else fails, backups determine whether you recover or start over.
3 Copies of your data 2 Different media types SSD TAPE 1 Copy stored offsite CLOUD
Full Backup
Complete copy of all data every time. Simple to restore but consumes the most storage and takes the longest to run. Typically done weekly or monthly, with incremental or differential backups between cycles.
Incremental Backup
Only backs up data changed since the last backup of any type. Fastest and smallest. But restoration requires the last full backup plus every incremental since then, in order. Failure of any one incremental breaks the chain.
Differential Backup
Backs up all data changed since the last full backup. Grows larger each day but restoration only requires the last full plus the latest differential. Balances speed and restore simplicity between full and incremental.
Immutable Backups
Ransomware operators specifically target backups for encryption or deletion. Immutable backups cannot be modified or deleted for a defined retention period -- even by administrators. Cloud providers offer object lock (AWS S3 Object Lock, Azure Immutable Blob Storage). Air-gapped tape remains the gold standard for immutability. If your backups can be encrypted by the same ransomware that hit production, they are not backups -- they are liabilities.
Slide 8 of 14
Incident Response Lifecycle
NIST SP 800-61 defines six phases. Each phase feeds the next. Skipping phases guarantees repeat incidents.
INCIDENT RESPONSE 01 Preparation 02 Detection 03 Containment 04 Eradication 05 Recovery 06 Lessons Learned
01
Prepare
Policies, tools, team training, playbooks
02
Detect
SIEM alerts, anomaly detection, user reports
03
Contain
Isolate affected systems, limit blast radius
04
Eradicate
Remove threat actor, patch vulnerabilities
05
Recover
Restore systems, verify integrity, monitor
06
Learn
Post-incident review, update playbooks
The Lessons Learned Gap
Most organizations complete phases 1-5 but skip phase 6. Without a formal post-incident review, the same root causes produce the same incidents. The Lessons Learned phase feeds back into Preparation -- it is not the end of the cycle but the beginning of the next one. NIST mandates this feedback loop for a reason.
Slide 9 of 14
Crisis Communication Plans
When a disaster occurs, who talks to whom, when, and through what channel? Poorly managed communication turns a technical incident into a public relations disaster.
Internal Communication
Incident response team activation, executive notification chains, employee updates. Define primary and backup communication channels -- if email is down, what do you use? Pre-draft templates for common scenarios. Establish a cadence: initial alert within 15 minutes, situation updates every 2 hours.
Stakeholder Notification
Customers, partners, vendors, board of directors. Each stakeholder group needs different information at different times. Customers need to know if their data is affected. Vendors need to know if SLAs are at risk. The board needs to understand financial exposure. Over-communicating is always better than silence.
Regulatory Reporting
GDPR requires breach notification within 72 hours. HIPAA requires notification within 60 days. State breach notification laws vary -- some require notification within 30 days. SEC rules require material cybersecurity incidents to be disclosed in 8-K filings within 4 business days. Missing a deadline creates legal liability on top of the original incident.
Media / Public Relations
Designate a single spokesperson. Pre-approve messaging templates with legal counsel. Never speculate about root cause, scope, or timeline in public statements. Acknowledge the incident, describe actions being taken, and provide a timeline for updates. Equifax's botched breach disclosure in 2017 caused more reputational damage than the breach itself.
Law Enforcement
FBI (IC3), CISA, and local law enforcement for criminal cyber incidents. Preserve evidence before contacting -- chain of custody matters. File reports through official channels: FBI's Internet Crime Complaint Center for cyber incidents, local police for physical incidents. Cooperation with law enforcement may be legally required for certain incident types.
The Communication Paradox
During a crisis, communication bandwidth is lowest precisely when communication needs are highest. Pre-built contact lists, message templates, escalation trees, and channel redundancy must exist before the incident. If you are figuring out who to call during the disaster, you have already failed.
Slide 10 of 14
Testing and Exercises
An untested plan is not a plan -- it is a document. Testing reveals gaps, builds muscle memory, and validates assumptions before a real disaster does.
FULL INTERRUPTION Actual failover FUNCTIONAL SIMULATION Simulated disaster, real procedures STRUCTURED WALKTHROUGH Step-by-step plan review with teams TABLETOP EXERCISE Discussion-based scenario walkthrough COST REALISM
Tabletop Exercise
A facilitated discussion where team members walk through a hypothetical scenario. No systems are touched. Low cost, easy to schedule, and effective at identifying communication gaps, role confusion, and plan deficiencies. Should be conducted at least quarterly. Common scenarios: ransomware attack, cloud provider outage, insider data theft.
Structured Walkthrough
Team members step through the actual plan document, procedure by procedure, verifying that each step is accurate, contact information is current, and dependencies are documented. More structured than a tabletop but still discussion-based. Often reveals outdated procedures and missing runbooks.
Functional Simulation
A realistic drill where teams execute recovery procedures in a test environment. Systems are actually failed over, backups are restored, communication trees are activated. Tests both the plan and the people. Reveals timing gaps -- the plan says "restore from backup" but how long does that actually take?
Full Interruption Test
The production system is actually taken offline and recovery procedures are executed for real. The most realistic test but the highest risk -- if recovery fails, you have a real outage. Reserved for mature organizations with high confidence in their plans. Financial regulators (FFIEC) may require full interruption tests annually.
Slide 11 of 14
Pandemic Lessons
COVID-19 exposed fundamental assumptions in business continuity planning. The organizations that survived had plans that accounted for workforce disruption at scale.
WAVE 1 Workforce Disruption Lockdowns Illness Childcare WAVE 2 Remote Access Demands VPN overload Laptop shortage SaaS scaling WAVE 3 Supply Chain Breakdown Vendor failures Chip shortage Shipping delays SUSTAINED Degraded Ops 18+ months New normal MARCH 2020 -------- CASCADING IMPACT -------- 2021+
VPN and Remote Access
Organizations designed for 10% remote work suddenly needed 100%. VPN concentrators saturated. Split-tunnel vs full-tunnel decisions had security implications. Zero Trust architectures proved more resilient than VPN-dependent models. Lesson: design remote access for surge capacity, not steady state.
Supply Chain Resilience
Just-in-time inventory models failed when global shipping stopped. Hardware refresh cycles extended by 6-12 months. Single-source vendors became single points of failure. Lesson: BCP must include vendor diversification, safety stock for critical components, and alternate sourcing agreements.
The Assumption That Broke
Pre-COVID BCPs assumed disasters were localized and temporary -- a hurricane hits one site, you fail over to another. COVID was global and sustained. Every site was affected simultaneously. Every vendor was affected simultaneously. The lesson: BCP must account for scenarios where there is no unaffected site to fail over to.
Slide 12 of 14
Regulatory Requirements
Business continuity is not optional in regulated industries. Non-compliance carries penalties that can exceed the cost of the disaster itself.
Framework Sector BCP/DR Requirements
FFIEC Financial Comprehensive BCP required for all FDIC-insured institutions. Must include BIA, risk assessment, testing, and board-approved plans. Annual testing with full interruption tests expected. Examiners review BCP during every safety and soundness exam.
HIPAA Healthcare Contingency Plan standard (45 CFR 164.308(a)(7)) requires data backup, disaster recovery, and emergency mode operation plans. Must maintain exact copies of ePHI and procedures for restoring systems. Testing and revision are required addressable specifications.
SOX Public Companies Section 404 requires internal controls over financial reporting, which includes IT continuity. Auditors assess whether financial systems can survive disruption. Failure to demonstrate adequate DR controls can result in material weakness findings.
GDPR Art 32 EU Data Requires "the ability to restore the availability and access to personal data in a timely manner in the event of a physical or technical incident." Also requires regular testing of security measures. Fines up to 4% of global annual revenue or 20 million euros.
NIST SP 800-34 Federal Contingency Planning Guide for Federal Information Systems. Defines the seven-step process: policy, BIA, preventive controls, recovery strategies, plan development, testing, and maintenance. Required for all federal systems under FISMA.
ISO 22301 International The international standard for business continuity management systems (BCMS). Certifiable standard that provides a framework for planning, establishing, implementing, operating, monitoring, reviewing, and maintaining a BCMS.
Compliance is the Floor, Not the Ceiling
Meeting regulatory minimums does not mean your organization will survive a disaster. Regulations define the baseline -- what you must do to avoid penalties. Effective BCP/DR goes beyond compliance to address your specific risk profile, operational dependencies, and recovery capabilities. The goal is resilience, not checkboxes.
Slide 13 of 14
Key Takeaways
The principles that separate organizations that recover from those that do not.
1 BCP keeps the business running during a disaster. DRP restores the technology. Both are required -- neither is sufficient alone. DRP is a subset of BCP.
2 The Business Impact Analysis (BIA) is the foundation. Without it, recovery priorities are driven by politics instead of data. Every continuity decision flows from the BIA.
3 RPO defines acceptable data loss. RTO defines acceptable downtime. MTD is the absolute limit. These metrics translate business requirements into technical specifications for every backup and DR contract.
4 DR site selection (hot, warm, cold, cloud) is a cost-vs-RTO tradeoff. Cloud-based DR is the modern default, offering hot-site capabilities at warm-site costs -- but requires careful planning for egress and vendor lock-in.
5 The 3-2-1 backup rule is the minimum standard. Immutable backups are essential in the ransomware era -- if your backups can be encrypted by the same attack, they are not backups.
6 The NIST incident response lifecycle has six phases. Most organizations skip Lessons Learned, which guarantees repeat incidents. The cycle is a loop, not a line.
7 An untested plan is not a plan. Testing ranges from tabletop exercises (low cost) to full interruption tests (high realism). Start with tabletops quarterly and work up.
8 COVID-19 broke the assumption that disasters are localized and temporary. Modern BCP must account for global, sustained disruption with no unaffected failover site.
What Comes Next
These concepts are not theoretical -- they are operational requirements tested by every major incident. When you write cybersecurity policy, every control you recommend either supports or undermines your organization's ability to continue operating and recover from disruption. BCP and DRP are the bridge between policy on paper and survival in practice.
Slide 14 of 14  |  Complete
Presentation
Complete
Business Continuity and Disaster Recovery -- 14 slides
BCP vs DRP  •  BIA  •  RTO/RPO/MTD/MTBF  •  Risk Assessment  •  DR Sites  •  3-2-1 Backups  •  IR Lifecycle  •  Crisis Comms  •  Testing  •  Pandemic Lessons  •  Regulatory
CIS2208 Cybersecurity Policy Week 2