FINAL CAPSTONE PROJECT

FAILSAFE

Disaster Recovery Operation

A production domain controller has crashed. Critical services are down. Users cannot authenticate. The clock is ticking. Your mission: execute a complete disaster recovery operation and restore full functionality.

CRITICAL INCIDENT: Production DC failure detected. Immediate response required.
← Back to Course Overview

Incident Report

Incident ID: INC-2026-0131-001
Priority: P1 - Critical
Status: Active - Awaiting Recovery

Summary: At 02:47 AM, monitoring detected that DC02.hexworth.local became unresponsive. Initial diagnosis indicates storage subsystem failure resulting in OS corruption. The server is currently offline. DC01 is operational but showing replication warnings. User authentication is degraded, and several services report connectivity issues.

02:47 AM
DC02 stopped responding to ping and authentication requests
02:48 AM
Monitoring alerts triggered - Multiple service failures detected
02:52 AM
On-call engineer notified - Initial assessment: Storage failure
03:15 AM
Hardware team confirms: RAID controller failure, OS partition corrupted
03:30 AM
Replacement server provisioned - You are now activated for recovery

Mission Objectives

On-Call Prologue Phase 0

Respond to a critical alert from home

  • Acknowledge the incoming alert
  • Ping DC01 and DC02 to assess connectivity
  • Attempt a remote session to DC02
  • Determine that on-site response is required

Assessment Phase 1

Evaluate current environment state

  • Verify DC01 is healthy and operational
  • Check AD replication status
  • Identify affected services and users
  • Locate and verify backup availability

Recovery Phase 2

Execute disaster recovery procedures

  • Deploy replacement server (DC02-NEW)
  • Restore from Windows Server Backup
  • Perform authoritative restore if needed
  • Seize FSMO roles if DC02 held them

Verification Phase 3

Validate recovery success

  • Test AD replication between DCs
  • Verify DNS resolution
  • Confirm user authentication
  • Test dependent services (DHCP, etc.)

Documentation Phase 4

Complete incident documentation

  • Document timeline of events
  • Record all recovery actions taken
  • Identify root cause
  • Recommend preventive measures

Cleanup Phase 5

Remove orphaned objects and finalize

  • Remove old DC02 from AD Sites
  • Clean up DNS records
  • Perform metadata cleanup if needed
  • Update monitoring systems

Prevention Phase 6

Implement preventive measures

  • Configure improved backup schedule
  • Set up storage health monitoring
  • Create runbook for future incidents
  • Schedule DR drill

Skills Assessment

This capstone tests your mastery of the complete WSA curriculum:

M02: Active Directory M03: Storage Management M07: Monitoring M08: DNS M10: Group Policy M18: PowerShell Automation M19: Troubleshooting Disaster Recovery FSMO Roles AD Replication

Deliverables

Grading Criteria

Phase Objectives Requirements
On-Call Prologue 3 Acknowledge alert, verify DC01 connectivity, confirm DC02 is down, attempt remote session
Assessment 6 Run diagnostics, identify FSMO holders, assess all affected services
Backup & Deployment 5 Select correct backup with valid rationale, configure replacement server with correct IP/DNS
AD / DNS / DHCP Recovery 17 Seize FSMO roles, promote new DC, restore DNS zones, configure DHCP failover
Sites & Verification 11 Update AD Sites, verify GPO/SYSVOL replication, confirm all services healthy
Documentation 5 Incident timeline, recovery actions, root cause with failure type identified, preventive measures

Critical Success Factors

All 47 objectives across 9 phases must be completed to pass. There is no partial credit — this simulates real-world expectations where incomplete recovery is not acceptable. Your elapsed time is recorded but not scored.

Time Consideration

While there is no strict time limit, in a real disaster recovery scenario, every minute of downtime impacts the business. Work efficiently but carefully - a hasty recovery that causes additional issues is worse than a methodical approach.

Ready to Execute Recovery?

The incident is active. Users are waiting. Systems are down. Your expertise is needed now.

Begin FAILSAFE Operation