← Back

SOC Operations

Security Operations Center Analyst Workflows

Eye House - Detection & Visibility

What is a Security Operations Center (SOC)?

A Security Operations Center (SOC) is a centralized unit that deals with security issues on an organizational and technical level. The SOC is the first line of defense against cyber threats, providing continuous monitoring, detection, analysis, and response to security incidents.

Core SOC Mission

Protect the organization's assets, data, and reputation through proactive monitoring and rapid incident response.

Primary Functions

Key SOC Technologies

SIEM

Security Information & Event Management

Central log aggregation, correlation, and alerting (Splunk, QRadar, Sentinel)

EDR/XDR

Endpoint Detection & Response

Advanced endpoint monitoring and threat hunting (CrowdStrike, SentinelOne)

IDS/IPS

Intrusion Detection/Prevention

Network-based threat detection (Snort, Suricata, Palo Alto)

SOAR

Security Orchestration & Automation

Automated playbooks and response workflows (Phantom, XSOAR)

SOC Value Proposition

A well-functioning SOC reduces Mean Time to Detect (MTTD) and Mean Time to Respond (MTTR), minimizing the impact of security incidents and protecting business operations.

SOC Organizational Models

Organizations can structure their SOC in various ways depending on resources, expertise, and business requirements. Each model has distinct advantages and challenges.

In-House SOC

Fully owned and operated by the organization with dedicated staff and infrastructure.

✓ Pros:

  • Full control and customization
  • Deep organizational knowledge
  • Immediate access to systems
  • Better data privacy

✗ Cons:

  • High operational costs
  • Recruitment challenges
  • 24/7 staffing requirements
  • Technology investment
Managed SOC (MSSP)

Outsourced to a Managed Security Service Provider who provides monitoring and response services.

✓ Pros:

  • Lower upfront costs
  • Access to expertise
  • 24/7 coverage included
  • Faster deployment

✗ Cons:

  • Less control
  • Limited customization
  • Data sharing concerns
  • Dependency on vendor
Hybrid SOC

Combination of in-house and outsourced capabilities, leveraging strengths of both approaches.

✓ Pros:

  • Balanced control and cost
  • Flexible scalability
  • Shared expertise
  • Risk distribution

✗ Cons:

  • Complex coordination
  • Integration challenges
  • Unclear responsibilities
  • Communication overhead
Virtual SOC

Distributed team working remotely with cloud-based tools and infrastructure.

✓ Pros:

  • Global talent access
  • Lower facility costs
  • Flexible workforce
  • Cloud-native tools

✗ Cons:

  • Communication challenges
  • Time zone coordination
  • Remote security risks
  • Team cohesion

Choosing the Right Model

Factors to Consider:

  • Budget: Available resources for staff, tools, and infrastructure
  • Expertise: Internal security talent and hiring capacity
  • Compliance: Regulatory requirements for data handling
  • Scale: Organization size and complexity
  • Risk Tolerance: Acceptable levels of outsourcing and control

Common Pitfall

Many organizations underestimate the total cost of ownership for an in-house SOC. Beyond tools and salaries, consider training, retention, facility costs, and the challenge of maintaining 24/7 coverage.

SOC Roles and Tier Structure

Most SOCs operate with a tiered structure where analysts are organized by skill level and responsibility. This creates clear escalation paths and ensures appropriate expertise is applied to each incident.

Tier 1 - Alert Analyst

Front Line Defense

Primary Responsibilities:

  • Monitor SIEM and security tool alerts
  • Perform initial alert triage
  • Classify alerts (TP/FP/BTP)
  • Document findings in tickets
  • Escalate confirmed threats
  • Follow established runbooks
  • Basic log analysis

Skills Required: Basic security concepts, log analysis, ticketing systems, communication

Typical Experience: Entry-level to 2 years

Tier 2 - Incident Responder

Deep Investigation

Primary Responsibilities:

  • In-depth incident investigation
  • Threat correlation and analysis
  • Malware analysis (basic)
  • Incident containment actions
  • Create incident reports
  • Develop detection rules
  • Mentor Tier 1 analysts

Skills Required: Network forensics, threat analysis, scripting, incident handling

Typical Experience: 2-5 years

Tier 3 - Threat Hunter

Advanced Operations

Primary Responsibilities:

  • Proactive threat hunting
  • Advanced malware analysis
  • Security tool engineering
  • Detection engineering
  • Complex incident response
  • Threat intelligence research
  • Architecture recommendations

Skills Required: Advanced forensics, reverse engineering, threat intelligence, automation

Typical Experience: 5+ years

SOC Manager

Leadership & Strategy

Primary Responsibilities:

  • SOC team management
  • Shift scheduling and coverage
  • Metrics and KPI reporting
  • Process improvement
  • Budget and resource planning
  • Stakeholder communication
  • Training and development

Skills Required: Leadership, communication, metrics analysis, project management

Typical Experience: 7+ years

Threat Intelligence Analyst

Intel Operations

Primary Responsibilities:

  • Collect and analyze threat data
  • Monitor threat actor activity
  • Produce intelligence reports
  • Feed IOCs to detection systems
  • Track emerging threats
  • Industry information sharing

Skills Required: Threat landscape knowledge, research, analysis, communication

Typical Experience: 3+ years

Incident Commander

Crisis Management

Primary Responsibilities:

  • Lead major incident response
  • Coordinate response teams
  • Communicate with executives
  • Make critical decisions
  • Post-incident review
  • Lessons learned documentation

Skills Required: Incident response, leadership, decision-making, communication

Typical Experience: 5+ years

Career Progression in the SOC

A typical career path: Tier 1 Analyst → Tier 2 Incident Responder → Tier 3 Threat Hunter → SOC Manager / Architect / CISO

Lateral moves are also common: Threat Intelligence, Security Engineering, Penetration Testing, GRC

Alert Lifecycle and Workflow

Understanding the complete lifecycle of a security alert is fundamental to SOC operations. Every alert follows a structured path from initial detection through final resolution.

Standard Alert Lifecycle

1. Detection

Alert generated by SIEM, EDR, IDS, or other security tool

2. Triage

Initial assessment: Is this real? How severe?

3. Investigation

Gather context, analyze logs, determine scope

4. Escalation

Route to appropriate tier or external team

5. Response

Contain, eradicate, recover

6. Resolution

Close ticket, document lessons learned

Detailed Phase Breakdown

Phase 1: Detection

Alert Sources:

  • SIEM correlation rules (Splunk, QRadar, Sentinel)
  • EDR/XDR behavioral detections (CrowdStrike, Carbon Black)
  • Network IDS/IPS signatures (Snort, Suricata)
  • Email security gateways (Proofpoint, Mimecast)
  • Vulnerability scanners (Qualys, Nessus)
  • Threat intelligence feeds (MISP, ThreatConnect)
  • User reports (phishing, suspicious activity)

Alert Metadata: Timestamp, source IP, destination IP, user, device, signature/rule, severity, confidence score

Phase 2: Triage (Critical Tier 1 Function)

Classification
Definition
Action
True Positive (TP)
Real security incident requiring response
Escalate immediately, initiate IR
False Positive (FP)
Benign activity incorrectly flagged
Close, tune detection rule
Benign True Positive (BTP)
Suspicious but authorized activity
Document, add to whitelist

Phase 3: Investigation

Key Investigation Questions:

  • What happened? Describe the event in plain language
  • When? Establish timeline of activity
  • Where? Which systems, networks, or users are affected?
  • Who? User accounts, threat actors, or processes involved
  • How? Attack vector and techniques used (map to MITRE ATT&CK)
  • Why? Motivation or business impact

Data Sources: SIEM logs, EDR telemetry, firewall logs, proxy logs, authentication logs, DNS logs, email headers

Phase 4: Escalation

Escalation Criteria (Tier 1 → Tier 2):

  • Confirmed malware or intrusion
  • Data exfiltration indicators
  • Lateral movement detected
  • Privilege escalation attempts
  • Multiple affected systems
  • Executive or critical system involvement
  • Complexity beyond Tier 1 scope

Escalation to Management/Legal: Data breach, ransomware, regulatory incident, major business impact

Phase 5: Response & Phase 6: Resolution

Response Actions: Isolate affected systems, disable accounts, block IPs/domains, remove malware, patch vulnerabilities, reset credentials

Documentation Requirements: Incident timeline, actions taken, systems affected, root cause, remediation steps, lessons learned

Closure Checklist: Threat eradicated, systems restored, monitoring in place, stakeholders notified, documentation complete

Alert Triage Methodology

Triage is the most critical skill for Tier 1 analysts. Effective triage reduces noise, prevents alert fatigue, and ensures real threats are escalated promptly. Poor triage leads to missed incidents or wasted resources.

The Triage Decision Tree

START: New Alert Received ↓ Question 1: Is this activity malicious or suspicious? ↓ NO → FALSE POSITIVE → Document reason → Tune detection rule if needed → Close ticket ↓ YES → Continue ↓ Question 2: Is this activity authorized or expected? ↓ YES → BENIGN TRUE POSITIVE → Verify authorization → Document exception → Add to whitelist if recurring → Close ticket ↓ NO → TRUE POSITIVE → Assess severity and urgency → Escalate to Tier 2 → Begin containment if time-critical

Detailed Triage Categories

1. True Positive (TP) - Real Threat

Indicators:

  • Known malicious IP/domain communication
  • Malware file hash match on VirusTotal
  • Exploitation of known vulnerability
  • Credential theft or brute force success
  • Unauthorized data access or exfiltration
  • Command-and-control (C2) beaconing

Example: EDR alert for PowerShell executing encoded commands, investigation shows download of Cobalt Strike beacon from known malicious domain.

Action: Escalate immediately with HIGH severity. Include all context: affected user/host, IOCs, initial containment actions.

2. False Positive (FP) - Benign Activity

Common Causes:

  • Overly broad detection signatures
  • Legitimate tools flagged as malicious (Admin tools, pentesting software)
  • Normal business processes triggering behavioral rules
  • Outdated threat intelligence (old IOCs)
  • Misconfigured security tools

Example: IDS alert for SQL injection, investigation shows automated vulnerability scanner from authorized security team.

Action: Close ticket as FP. Document the reason. Submit rule tuning request to reduce future FPs. Consider whitelisting source.

3. Benign True Positive (BTP) - Authorized But Flagged

Common Scenarios:

  • IT admin using remote access tools outside business hours
  • Developer accessing production database per change control
  • Security team running penetration tests
  • Authorized third-party vendor access
  • Unusual but legitimate user travel (VPN from foreign country)

Example: Alert for abnormal login time and location. User confirms they are traveling internationally for business.

Action: Verify authorization through ticketing system, email, or manager confirmation. Document justification. Add exception if recurring.

Triage Best Practices

Speed vs. Accuracy

Balance is critical. Rapid triage prevents alert backlog, but rushing leads to missed threats. Aim for 5-15 minutes per alert depending on complexity.

Context is Everything

Never make decisions based solely on the alert. Check user role, system criticality, time of day, geolocation, recent changes.

Document Everything

Your notes may be reviewed during audits or legal proceedings. Include what you checked, why you made your decision, and next steps.

When in Doubt, Escalate

It's better to escalate a questionable alert than to close a real incident. Tier 2 can always de-escalate if needed.

Common Triage Mistakes

  • Confirmation Bias: Seeing what you expect rather than what's there (assuming all alerts are FPs)
  • Alert Fatigue: Closing alerts without investigation due to high volume
  • Insufficient Context: Making decisions without checking related logs
  • Over-Reliance on Severity: Dismissing low-severity alerts that are part of a larger attack
  • Poor Documentation: Not recording triage logic for future reference

Escalation Procedures and Criteria

Effective escalation ensures that the right expertise is applied to each incident while preventing bottlenecks. Understanding when, how, and to whom to escalate is essential for SOC efficiency.

Escalation Decision Matrix

LOW Severity
Tier 1 Handles

Single user affected, no data loss, known FP pattern, standard remediation available

Example: Phishing email blocked by gateway

MEDIUM Severity
Escalate to Tier 2

Multiple users, confirmed malware, lateral movement indicators, requires investigation

Example: Trojan detected on workstation, contained but needs analysis

HIGH Severity
Escalate to Tier 3

Critical systems, data exfiltration, advanced techniques, zero-day exploit

Example: Ransomware encryption across file servers

CRITICAL Severity
Incident Commander

Active breach, widespread impact, executive involvement, regulatory reporting needed

Example: Nation-state APT with confirmed data exfiltration

Escalation Criteria by Category

Technical Escalation (Tier 1 → Tier 2)

Escalate when:

  • Confirmed malware that bypassed preventive controls
  • Successful exploitation of a vulnerability
  • Evidence of credential compromise or privilege escalation
  • Lateral movement between systems detected
  • Data exfiltration indicators (large uploads, unusual protocols)
  • Multiple related alerts suggesting coordinated attack
  • Investigation requires forensic tools or deep analysis
  • Runbook doesn't cover the scenario

Management Escalation (SOC → Leadership)

Escalate when:

  • Incident affects executive systems or data
  • Business-critical systems are compromised or unavailable
  • Suspected data breach requiring regulatory notification
  • Media or public attention likely
  • Attack suggests targeted campaign or APT
  • Financial fraud or wire transfer compromise
  • Response requires significant business decisions (shutdown systems, notify customers)

External Escalation

Legal / Compliance

  • PII/PHI data breach
  • Regulatory reporting required (GDPR, HIPAA, PCI)
  • Law enforcement involvement needed
  • Contractual breach notification

Executive Management

  • Business continuity impact
  • Reputational risk
  • Strategic decision required
  • Major financial impact

IT Operations

  • System patching required
  • Network changes needed
  • Service restoration
  • Configuration changes

External IR / Law Enforcement

  • Capabilities exceeded
  • Criminal investigation
  • Advanced forensics needed
  • Nation-state actor

Effective Escalation Communication

GOOD Escalation Example: SUBJECT: [HIGH] Confirmed Malware - User jdoe - Finance Workstation SUMMARY: Tier 1 confirmed malware on Finance user workstation. System isolated. Escalating for malware analysis and scope determination. DETAILS: - Alert: EDR behavioral detection "Suspicious PowerShell Activity" - User: jdoe (Finance Department - Payroll Access) - Host: FIN-WS-042 (10.20.30.42) - Time: 2025-12-21 14:32 UTC - Initial Triage: User clicked email link, PowerShell downloaded and executed file from hxxp://malicious-domain[.]com/payload.exe - VirusTotal: 42/70 engines detect as Emotet variant - Actions Taken: Host isolated via EDR, user notified, manager informed - Urgency: User has access to payroll systems and bank account information ESCALATION REASON: Requires malware analysis, lateral movement check, and credential reset scope determination. ATTACHMENTS: Screenshot of EDR alert, VirusTotal report, initial timeline
BAD Escalation Example (Don't Do This): SUBJECT: Alert There's an alert on some computer. Looks bad. Can someone check it out? Problems: No context, no urgency, no details, no actions taken, unprofessional

Escalation Best Practices

  • Include all relevant context (who, what, when, where, why)
  • State what you've already checked and ruled out
  • Clearly articulate why escalation is needed
  • Attach supporting evidence (logs, screenshots, reports)
  • Set appropriate urgency/severity level
  • Provide your initial assessment and recommendations
  • Follow your SOC's escalation SLA (typically 15-30 min for high severity)

Shift Handoff and Communication

Effective shift handoff is critical in a 24/7 SOC. Poor handoffs lead to missed incidents, duplicated work, and alert escalation delays. Treat handoff as a formal process, not an afterthought.

Why Shift Handoff Matters

Consequences of Poor Handoff:

  • Ongoing incidents fall through the cracks
  • Next shift re-investigates already triaged alerts (wasted effort)
  • Context is lost, delaying incident response
  • Escalations are delayed or forgotten
  • Alert fatigue increases from duplicate work

Shift Handoff Components

1. Ongoing Incidents

For each active incident, document:

  • Incident ID and Summary: Ticket number and one-line description
  • Current Status: Investigation, containment, waiting for external input
  • Actions Taken: What has been done so far
  • Next Steps: What needs to happen next and by when
  • Waiting On: Any blockers or dependencies (IT team, vendor response)
  • Severity/Urgency: How critical is immediate action

2. Pending Alerts

Alert queue status:

  • Backlog Count: How many alerts are still untriaged
  • Priority Alerts: Any high-severity alerts that need immediate attention
  • Trends: Spike in specific alert types (may indicate ongoing attack)
  • Known Issues: Tool malfunctions causing alert storms

3. Escalations and Follow-ups

Track escalated items:

  • Tickets escalated to Tier 2/3 awaiting response
  • Items escalated to IT/management needing follow-up
  • Expected callback times from vendors or external teams
  • Scheduled maintenance or changes that may cause alerts

4. Environmental Notes

Situational awareness:

  • Scheduled maintenance windows (patching, upgrades)
  • Known false positive sources being investigated
  • New detection rules deployed (may cause alert increase)
  • Tool outages or degraded performance
  • Ongoing threat campaigns (phishing wave, ransomware targeting industry)

Handoff Format Example

SOC SHIFT HANDOFF REPORT Shift: Day Shift (08:00 - 16:00 UTC) Date: 2025-12-21 Analyst: Sarah Chen Next Shift: Evening (16:00 - 00:00 UTC) - Mike Johnson ═══════════════════════════════════════════════════════════════ ONGOING INCIDENTS (Action Required): [INC-12456] HIGH - Suspected Credential Stuffing Attack Status: Investigation in progress Summary: Multiple failed logins from distributed IPs targeting VPN portal Actions Taken: - Identified 47 targeted accounts - Blocked 23 malicious IPs via firewall - Notified affected users to reset passwords Next Steps: - Tier 2 performing log analysis for successful logins (ETA 17:00) - Monitor for additional login attempts Urgency: HIGH - Active attack [INC-12461] MEDIUM - Malware Quarantined on HR Workstation Status: Containment complete, awaiting final verification Summary: Emotet trojan quarantined by EDR, no execution occurred Actions Taken: - EDR quarantined file automatically - Verified no C2 communication - User educated on phishing Next Steps: - IT patching system tonight - Close ticket after patch verification tomorrow Urgency: LOW - Contained ═══════════════════════════════════════════════════════════════ ALERT QUEUE STATUS: Total Pending: 12 alerts - HIGH: 0 - MEDIUM: 3 (prioritize IDS alerts from DMZ) - LOW: 9 Trend: Increase in blocked phishing emails (34 today vs. 12 avg) - Appears to be targeting Finance department - Threat intel notified, investigating campaign ═══════════════════════════════════════════════════════════════ ESCALATIONS PENDING RESPONSE: - Ticket #12450: Escalated to IT Ops for patch deployment (waiting since 12:00) - Ticket #12455: Escalated to Tier 2 for C2 beacon analysis (under investigation) ═══════════════════════════════════════════════════════════════ ENVIRONMENTAL NOTES: - Scheduled firewall maintenance 20:00-22:00 (may lose connectivity alerts) - New SIEM rule deployed for detecting Kerberoasting (may see initial FPs) - CISO requested daily summary of phishing metrics (send by EOD) ═══════════════════════════════════════════════════════════════ METRICS (This Shift): Alerts Triaged: 87 - True Positives: 3 - False Positives: 76 - Benign True Positives: 8 Incidents Created: 4 Escalations: 2 MTTD: 8 minutes MTTR: 32 minutes ═══════════════════════════════════════════════════════════════ Questions? Contact me: sarah.chen@company.com / ext. 5423

Handoff Best Practices

Document During Shift

Don't wait until handoff time to document. Update your handoff notes throughout the shift so you don't forget critical details.

Verbal + Written

Overlap shifts if possible for 15-30 min verbal handoff. Walk through critical items. Written doc is backup, not replacement.

Flag the Critical

Clearly mark items needing immediate attention. Use severity tags, bold text, or highlighting so they stand out.

Encourage Questions

Make sure incoming shift understands and has your contact info. Ambiguity leads to mistakes.

Common Handoff Failures

  • "Everything's fine": Even if quiet, document what you checked and current queue status
  • Incomplete Context: "Some server has an issue" - be specific about what, where, severity
  • No Next Steps: Leaving incoming shift to figure out what to do next
  • Lost Escalations: Forgetting to mention items sent to Tier 2 or management
  • Tribal Knowledge: Assuming next shift knows about ongoing situations or tool quirks

SOC Metrics and Key Performance Indicators

Metrics provide visibility into SOC performance, identify areas for improvement, and demonstrate value to stakeholders. However, metrics must be meaningful and actionable—avoid "vanity metrics" that look good but don't drive improvement.

Core SOC Metrics

1. Mean Time to Detect (MTTD)

MTTD
12 minutes

Definition: Average time from when an attack begins to when it's detected

Calculation: Sum of (Detection Time - Incident Start Time) / Number of Incidents

Target: <15 minutes for most organizations, <5 minutes for high-security environments

Why It Matters: Faster detection limits attacker dwell time and reduces damage

Improvement Strategies: Better detection rules, threat intelligence integration, proactive hunting

2. Mean Time to Respond (MTTR)

MTTR
45 minutes

Definition: Average time from detection to containment/remediation

Calculation: Sum of (Resolution Time - Detection Time) / Number of Incidents

Target: <60 minutes for high-severity incidents, varies by severity

Why It Matters: Speed of response directly correlates with limiting blast radius

Improvement Strategies: Automated response playbooks, runbooks, training, clear escalation paths

3. Alert Volume and Triage Metrics

Alerts per Day
1,247

Total security alerts generated. High volume indicates potential tuning needs.

True Positive Rate
4.2%

Percentage of alerts that are real incidents. Typical: 2-10%

False Positive Rate
87.3%

Percentage of alerts that are benign. Goal: <90%, ideally <80%

Benign TP Rate
8.5%

Authorized activity flagged as suspicious. Whitelist candidates.

Alert Volume Health Check:

  • <500 alerts/day: Possibly insufficient monitoring coverage or over-tuned rules
  • 500-2000 alerts/day: Healthy for medium-sized organizations
  • 2000-5000 alerts/day: Manageable with automation and proper staffing
  • >5000 alerts/day: Risk of alert fatigue, aggressive tuning needed

4. Incident Metrics

Incidents per Month
23

Total confirmed security incidents requiring response

Critical Incidents
2

High-impact incidents requiring management notification

Incident Backlog
5

Open incidents not yet resolved. Should trend toward zero.

Recurring Incidents
18%

Percentage of repeat incidents. Indicates root cause not addressed.

5. Coverage and Visibility Metrics

Log Source Coverage

94% of critical assets sending logs to SIEM

Target: >95% for critical systems, >80% for all systems

EDR Deployment

98% of endpoints with EDR agent installed

Target: >95% for workstations, 100% for servers

MITRE ATT&CK Coverage

78% of techniques have detection coverage

Target: >70% overall, >90% for high-priority techniques

Detection Rule Health

87% of rules firing in last 30 days

Dead rules should be reviewed and retired/improved

6. Analyst Performance Metrics

Use with Caution: Analyst metrics can be helpful for training but should never be weaponized. Focusing too heavily on individual metrics creates perverse incentives (e.g., closing tickets quickly without proper investigation).

  • Alerts Triaged per Shift: Productivity indicator (typical: 50-100 depending on complexity)
  • Triage Accuracy: Percentage of triage decisions upheld by Tier 2 review
  • Escalation Rate: Percentage of alerts escalated (typical: 5-15%)
  • Documentation Quality: Completeness of ticket notes (subjective, peer-reviewed)
  • SLA Compliance: Percentage of alerts triaged within SLA time (e.g., 15 min)

How to Use Metrics Effectively

  • Trend Over Time: Don't obsess over single data points. Look for trends over weeks/months.
  • Context Matters: Spike in alerts may be due to new detection rule, not worsening security.
  • Drive Action: Metrics should lead to concrete improvements (tuning, training, tool changes).
  • Communicate Value: Use metrics to show leadership the SOC's impact and justify resources.
  • Balance Leading and Lagging: MTTD/MTTR are lagging (measure past). Coverage is leading (predict future).
  • Avoid Vanity Metrics: "Blocked 1 million threats" sounds impressive but lacks context. What's the trend? The impact?

Dashboard Example

SOC WEEKLY DASHBOARD - Week of 2025-12-15 ┌─────────────────────────────────────────────────────────────┐ │ DETECTION & RESPONSE │ ├─────────────────────────────────────────────────────────────┤ │ Mean Time to Detect (MTTD): 11 min [↓ -2 min] ✓ │ │ Mean Time to Respond (MTTR): 38 min [↓ -7 min] ✓ │ │ Incidents Created: 18 [↑ +3] │ │ Critical Incidents: 1 [→ same] │ └─────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────┐ │ ALERT METRICS │ ├─────────────────────────────────────────────────────────────┤ │ Total Alerts: 8,734 [↑ +12%] │ │ True Positive Rate: 3.8% [→ same] │ │ False Positive Rate: 88.1% [↑ +3%] │ │ Avg Triage Time: 7 min [→ same] │ └─────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────┐ │ COVERAGE │ ├─────────────────────────────────────────────────────────────┤ │ Log Source Coverage: 94% [→ same] │ │ EDR Agent Deployment: 97% [↑ +1%] ✓ │ │ Detection Rules Active: 247 [↑ +5] │ └─────────────────────────────────────────────────────────────┘ KEY FINDINGS: ✓ Response times improved due to new SOAR playbooks Alert volume spike from new cloud monitoring (tuning in progress) FP rate increase due to new rule deployment (tuning scheduled)

SOC Runbooks and Playbooks

Runbooks and playbooks are essential SOC documentation that standardizes incident response, reduces decision fatigue, and ensures consistent handling of common scenarios. They're especially critical for Tier 1 analysts who may encounter unfamiliar situations.

Runbook vs. Playbook

Runbook

Step-by-Step Investigation Guide

Purpose: Provide detailed instructions for investigating a specific alert type or scenario

Scope: Single alert type or specific detection rule

Audience: Primarily Tier 1 analysts

Example: "Runbook for Investigating Brute Force Login Alerts"

Contents: Triage steps, data to collect, decision tree, escalation criteria

Playbook

End-to-End Response Workflow

Purpose: Orchestrate complete incident response from detection through recovery

Scope: Entire incident category or attack type

Audience: All SOC tiers, IR team, IT, management

Example: "Ransomware Incident Response Playbook"

Contents: Response phases, roles & responsibilities, communication plan, containment/eradication steps

Sample Runbook: Malware Detection Alert

RUNBOOK: EDR Malware Detection Alert Version: 2.1 | Last Updated: 2025-12-01 | Owner: SOC Team ═══════════════════════════════════════════════════════════════ 1. INITIAL TRIAGE (5 minutes) □ Verify alert details in EDR console (CrowdStrike/SentinelOne) - Alert timestamp, severity, detection method - Affected hostname and IP address - Username and user department - Malware family/type identified - Current host status (online, isolated, offline) □ Check VirusTotal/hybrid-analysis.com for file hash - Upload hash (NOT the file itself) - Document detection ratio (e.g., 45/70) - Note malware family classification □ Determine if system is already isolated - If YES: Proceed to investigation - If NO: Consider immediate isolation if high severity DECISION POINT: - 0-10 vendors detect: Likely FP → Verify with vendor, document, close - 10-30 vendors detect: Investigate further (proceed to step 2) - 30+ vendors detect: Confirmed malware → Isolate immediately, escalate ═══════════════════════════════════════════════════════════════ 2. CONTEXT GATHERING (10 minutes) □ User Information - Check Active Directory: User role, department, privileges - Contact user: Ask about recent downloads, email clicks - Recent access: Check if user accessed sensitive systems □ Host Information - Asset criticality: Production server? Executive workstation? - Check CMDB for system purpose and data classification - Recent changes: Software installs, patches applied? □ Malware Execution Status - Was file executed or just downloaded? - Check EDR process tree for indicators of execution - Look for persistence mechanisms (registry, scheduled tasks) □ Network Activity - Check firewall/proxy logs for C2 communication - Look for data exfiltration (large uploads, unusual protocols) - Identify any lateral movement attempts DECISION POINT: - File quarantined before execution: Lower priority, likely can handle at Tier 1 - File executed with C2 communication: ESCALATE TO TIER 2 IMMEDIATELY - Multiple hosts affected: ESCALATE TO TIER 2 IMMEDIATELY ═══════════════════════════════════════════════════════════════ 3. CONTAINMENT (If Not Already Isolated) □ For confirmed malware with execution: - Isolate host via EDR (prevents network communication) - Disable user account in AD - Block C2 IPs/domains at firewall - Alert other teams (IT, IR) via Slack/email □ Document containment actions with timestamps ═══════════════════════════════════════════════════════════════ 4. TRIAGE CLASSIFICATION FALSE POSITIVE - Close ticket if: ✓ Low VirusTotal detection (<10) ✓ File is known legitimate software ✓ Vendor confirms as FP → Action: Document reason, submit FP to vendor, close TRUE POSITIVE (Tier 1 Handles) - Continue if: ✓ Malware quarantined before execution ✓ No C2 communication detected ✓ Single host affected ✓ Not a critical system or privileged user → Action: Proceed to step 5 TRUE POSITIVE (Escalate to Tier 2) - Escalate if: ✗ Malware executed successfully ✗ C2 communication detected ✗ Multiple hosts affected ✗ Critical system or executive user ✗ Ransomware indicators → Action: Create high-priority escalation ticket with all context ═══════════════════════════════════════════════════════════════ 5. REMEDIATION (Tier 1 - Simple Cases Only) □ Verify malware quarantined by EDR □ Run full endpoint scan to ensure no other malware □ Check for scheduled tasks, registry run keys (persistence) □ User education: Send phishing awareness reminder □ Document incident in ticket with full timeline □ Un-isolate host once verified clean □ Re-enable user account □ Monitor for 24 hours for re-infection ═══════════════════════════════════════════════════════════════ 6. DOCUMENTATION Required fields in ticket: - Malware family and hash - VirusTotal detection ratio - Execution status (yes/no) - C2 communication (yes/no) - User notification (yes/no) - Containment actions taken - Final disposition (TP/FP/BTP) ═══════════════════════════════════════════════════════════════ ESCALATION CRITERIA: → Malware executed + C2 communication → Ransomware indicators → Multiple hosts infected → Executive or critical system → Unusual/unknown malware family → Analyst unsure of next steps CONTACTS: Tier 2 Escalation: tier2@company.com / Slack #soc-tier2 EDR Support: edr-support@company.com User Support: helpdesk@company.com

Key Elements of Effective Runbooks

Clear Steps

Use checkboxes, numbered steps, and action verbs. Avoid ambiguity like "check for suspicious activity"—specify what to check and where.

Time Estimates

Help analysts manage their time and identify when they're going down a rabbit hole.

Decision Points

Clearly defined "if this, then that" logic helps analysts make confident triage decisions.

Examples

Include screenshots, sample logs, and example scenarios to illustrate concepts.

Contact Info

Who to escalate to, who to call for help, and where to find additional resources.

Version Control

Date, version number, and owner. Outdated runbooks are worse than no runbooks.

Common Runbook Topics

  • Brute Force / Password Spray Attacks
  • Phishing Email Investigations
  • Malware Detection (EDR/AV alerts)
  • Suspicious PowerShell Activity
  • Unusual Login Location / Impossible Travel
  • DDoS Attack Response
  • Data Exfiltration Indicators
  • Privilege Escalation Attempts
  • Web Application Attacks (SQLi, XSS)
  • Insider Threat Indicators

Runbook Maintenance Best Practices

  • Review and update runbooks quarterly or after major incidents
  • Incorporate lessons learned from post-incident reviews
  • Get feedback from analysts who actually use them
  • Test runbooks during tabletop exercises
  • Retire outdated runbooks rather than letting them accumulate
  • Make runbooks easily searchable (wiki, confluence, SharePoint)
  • Include runbooks in new analyst onboarding training

Runbook Anti-Patterns

  • Too Vague: "Investigate the alert" without specifics
  • Too Rigid: No room for analyst judgment or unusual scenarios
  • Outdated: References tools or processes no longer in use
  • Overly Complex: 50-page document when 2 pages would suffice
  • No Ownership: No one accountable for keeping it current

Analyst Well-Being and Burnout Prevention

SOC analyst burnout is a critical issue in cybersecurity. The combination of high stress, shift work, alert fatigue, and constant exposure to threats takes a toll. Sustainable SOC operations require proactive attention to analyst well-being.

The SOC Burnout Crisis

Industry Statistics:

  • Average SOC analyst tenure: 18-24 months (high turnover)
  • 70% of SOC analysts report high stress levels
  • Alert fatigue cited as top reason for leaving SOC roles
  • 24/7 shift work disrupts sleep and personal life
  • Constant exposure to threats can lead to anxiety and cynicism

Primary Burnout Contributors

1. Alert Fatigue

The Problem: Hundreds or thousands of alerts daily, most of which are false positives. Analysts become desensitized and may miss real threats.

Solutions:

  • Aggressive false positive tuning and alert reduction programs
  • Automation of low-value triage tasks via SOAR
  • Alert prioritization and risk-based routing
  • Regular "alert health" reviews to retire noisy, low-value rules
  • Give analysts permission to question and challenge unhelpful alerts

2. Shift Work Challenges

The Problem: 24/7 coverage requires night shifts, weekend work, and rotating schedules that disrupt circadian rhythms and personal life.

Solutions:

  • Limit consecutive night shifts (no more than 3-4 in a row)
  • Allow analyst input on scheduling preferences when possible
  • Provide adequate shift differential pay for nights/weekends
  • Consider "follow-the-sun" model with geographically distributed teams
  • Ensure minimum time off between shifts (8-12 hours)
  • Provide quiet break rooms and encourage regular breaks

3. Lack of Career Growth

The Problem: Analysts feel stuck in reactive triage work with no clear path forward, leading to frustration and attrition.

Solutions:

  • Clear career progression paths (Tier 1 → Tier 2 → Tier 3 → Management/Architecture)
  • Training budgets for certifications (CySA+, GCIH, GCIA)
  • Rotation opportunities (threat intel, detection engineering, incident response)
  • Mentorship programs pairing junior analysts with senior staff
  • Recognition programs for excellent work and continuous improvement
  • Time allocated for skill development during work hours

4. Repetitive, Low-Value Work

The Problem: Analysts spend 80% of time on simple triage, closing obvious false positives, feeling like "alert button-clickers."

Solutions:

  • Automate obvious false positives via SOAR playbooks
  • Allocate time for analysts to work on improvement projects
  • Encourage detection engineering and rule creation
  • Threat hunting rotations to break up monotony
  • Involve analysts in tool selection and process improvement

Organizational Strategies for Preventing Burnout

Realistic Workload

Don't expect analysts to triage 200 alerts per shift. Quality over quantity. Allow time for thorough investigation.

Right Tools

Invest in SIEM, SOAR, and automation. Analysts burn out faster when forced to use clunky, ineffective tools.

Continuous Learning

Provide training, conference attendance, and time to research new attack techniques and defenses.

Psychological Safety

Create a culture where analysts can ask questions, admit mistakes, and escalate without fear of blame.

Recognition

Celebrate wins: incident detection, process improvements, and continuous learning. Analysts need to feel valued.

Work-Life Balance

Enforce PTO usage, respect off-hours, and don't glorify overwork. Burnout helps no one.

Individual Analyst Self-Care

What you can do as an analyst:

  • Take Breaks: Step away from screens regularly, especially during high-stress incidents
  • Maintain Sleep Hygiene: Especially critical for shift workers—blackout curtains, consistent sleep schedule
  • Exercise: Physical activity is proven to reduce stress and improve focus
  • Build Community: Connect with other SOC analysts, share experiences, learn from each other
  • Set Boundaries: Don't check work email/Slack on days off unless on-call
  • Pursue Interests: Hobbies and activities outside cybersecurity provide mental recovery
  • Seek Support: Don't hesitate to use EAP (Employee Assistance Programs) or talk to a counselor
  • Know When to Move On: If a role is consistently damaging your health, it's okay to find a different position

Signs of Burnout to Watch For

  • Emotional Exhaustion: Feeling drained, cynical, detached from work
  • Reduced Performance: Difficulty concentrating, making mistakes, missing details
  • Physical Symptoms: Headaches, insomnia, digestive issues, frequent illness
  • Cynicism: "Nothing matters," "all alerts are false positives," detachment from impact of work
  • Irritability: Short temper, conflicts with colleagues, negative attitude
  • Absenteeism: Calling in sick more often, dreading going to work

If you notice these signs in yourself or colleagues, speak up and seek support.

Remember: You're Protecting People

SOC work can feel thankless—most of what you do prevents incidents that never happen. But your work matters enormously. You protect:

  • Customer data and privacy
  • Employee personal information
  • Business operations and revenue
  • Your organization's reputation
  • Jobs and livelihoods of your colleagues

Your vigilance, even when handling the 100th false positive of the day, keeps the organization safe. That's valuable work worthy of respect and sustainability.

Module Complete!

You've finished the SOC Operations presentation. You now understand the structure, workflows, and daily realities of Security Operations Center analyst work.

Key Takeaways:

Click the button below to mark this module as complete and earn your achievement!

Course Home