SOC Operations - Eye House

What is a Security Operations Center (SOC)?

A Security Operations Center (SOC) is a centralized unit that deals with security issues on an organizational and technical level. The SOC is the first line of defense against cyber threats, providing continuous monitoring, detection, analysis, and response to security incidents.

Core SOC Mission

Protect the organization's assets, data, and reputation through proactive monitoring and rapid incident response.

Primary Functions

Continuous Monitoring: 24/7/365 surveillance of networks, systems, and applications
Threat Detection: Identify potential security incidents through SIEM, IDS/IPS, EDR, and other tools
Incident Response: Rapid triage, investigation, containment, and remediation of threats
Threat Intelligence: Collection and analysis of threat data to anticipate attacks
Vulnerability Management: Identify and prioritize security weaknesses
Compliance: Ensure adherence to security policies and regulatory requirements
Security Tool Management: Maintain and optimize security infrastructure

Key SOC Technologies

SIEM

Security Information & Event Management

Central log aggregation, correlation, and alerting (Splunk, QRadar, Sentinel)

EDR/XDR

Endpoint Detection & Response

Advanced endpoint monitoring and threat hunting (CrowdStrike, SentinelOne)

IDS/IPS

Intrusion Detection/Prevention

Network-based threat detection (Snort, Suricata, Palo Alto)

SOAR

Security Orchestration & Automation

Automated playbooks and response workflows (Phantom, XSOAR)

SOC Value Proposition

A well-functioning SOC reduces Mean Time to Detect (MTTD) and Mean Time to Respond (MTTR), minimizing the impact of security incidents and protecting business operations.

SOC Organizational Models

Organizations can structure their SOC in various ways depending on resources, expertise, and business requirements. Each model has distinct advantages and challenges.

In-House SOC

Fully owned and operated by the organization with dedicated staff and infrastructure.

✓ Pros:

Full control and customization
Deep organizational knowledge
Immediate access to systems
Better data privacy

✗ Cons:

High operational costs
Recruitment challenges
24/7 staffing requirements
Technology investment

Managed SOC (MSSP)

Outsourced to a Managed Security Service Provider who provides monitoring and response services.

✓ Pros:

Lower upfront costs
Access to expertise
24/7 coverage included
Faster deployment

✗ Cons:

Less control
Limited customization
Data sharing concerns
Dependency on vendor

Hybrid SOC

Combination of in-house and outsourced capabilities, leveraging strengths of both approaches.

✓ Pros:

Balanced control and cost
Flexible scalability
Shared expertise
Risk distribution

✗ Cons:

Complex coordination
Integration challenges
Unclear responsibilities
Communication overhead

Virtual SOC

Distributed team working remotely with cloud-based tools and infrastructure.

✓ Pros:

Global talent access
Lower facility costs
Flexible workforce
Cloud-native tools

✗ Cons:

Communication challenges
Time zone coordination
Remote security risks
Team cohesion

Choosing the Right Model

Factors to Consider:

Budget: Available resources for staff, tools, and infrastructure
Expertise: Internal security talent and hiring capacity
Compliance: Regulatory requirements for data handling
Scale: Organization size and complexity
Risk Tolerance: Acceptable levels of outsourcing and control

Common Pitfall

Many organizations underestimate the total cost of ownership for an in-house SOC. Beyond tools and salaries, consider training, retention, facility costs, and the challenge of maintaining 24/7 coverage.

SOC Roles and Tier Structure

Most SOCs operate with a tiered structure where analysts are organized by skill level and responsibility. This creates clear escalation paths and ensures appropriate expertise is applied to each incident.

Tier 1 - Alert Analyst

Front Line Defense

Primary Responsibilities:

Monitor SIEM and security tool alerts
Perform initial alert triage
Classify alerts (TP/FP/BTP)
Document findings in tickets
Escalate confirmed threats
Follow established runbooks
Basic log analysis

Skills Required: Basic security concepts, log analysis, ticketing systems, communication

Typical Experience: Entry-level to 2 years

Tier 2 - Incident Responder

Deep Investigation

Primary Responsibilities:

In-depth incident investigation
Threat correlation and analysis
Malware analysis (basic)
Incident containment actions
Create incident reports
Develop detection rules
Mentor Tier 1 analysts

Skills Required: Network forensics, threat analysis, scripting, incident handling

Typical Experience: 2-5 years

Tier 3 - Threat Hunter

Advanced Operations

Primary Responsibilities:

Proactive threat hunting
Advanced malware analysis
Security tool engineering
Detection engineering
Complex incident response
Threat intelligence research
Architecture recommendations

Skills Required: Advanced forensics, reverse engineering, threat intelligence, automation

Typical Experience: 5+ years

SOC Manager

Leadership & Strategy

Primary Responsibilities:

SOC team management
Shift scheduling and coverage
Metrics and KPI reporting
Process improvement
Budget and resource planning
Stakeholder communication
Training and development

Skills Required: Leadership, communication, metrics analysis, project management

Typical Experience: 7+ years

Threat Intelligence Analyst

Intel Operations

Primary Responsibilities:

Collect and analyze threat data
Monitor threat actor activity
Produce intelligence reports
Feed IOCs to detection systems
Track emerging threats
Industry information sharing

Skills Required: Threat landscape knowledge, research, analysis, communication

Typical Experience: 3+ years

Incident Commander

Crisis Management

Primary Responsibilities:

Lead major incident response
Coordinate response teams
Communicate with executives
Make critical decisions
Post-incident review
Lessons learned documentation

Skills Required: Incident response, leadership, decision-making, communication

Typical Experience: 5+ years

Career Progression in the SOC

A typical career path: Tier 1 Analyst → Tier 2 Incident Responder → Tier 3 Threat Hunter → SOC Manager / Architect / CISO

Lateral moves are also common: Threat Intelligence, Security Engineering, Penetration Testing, GRC

Alert Lifecycle and Workflow

Understanding the complete lifecycle of a security alert is fundamental to SOC operations. Every alert follows a structured path from initial detection through final resolution.

Standard Alert Lifecycle

1. Detection

Alert generated by SIEM, EDR, IDS, or other security tool

2. Triage

Initial assessment: Is this real? How severe?

3. Investigation

Gather context, analyze logs, determine scope

4. Escalation

Route to appropriate tier or external team

5. Response

Contain, eradicate, recover

6. Resolution

Close ticket, document lessons learned

Detailed Phase Breakdown

Phase 1: Detection

Alert Sources:

SIEM correlation rules (Splunk, QRadar, Sentinel)
EDR/XDR behavioral detections (CrowdStrike, Carbon Black)
Network IDS/IPS signatures (Snort, Suricata)
Email security gateways (Proofpoint, Mimecast)
Vulnerability scanners (Qualys, Nessus)
Threat intelligence feeds (MISP, ThreatConnect)
User reports (phishing, suspicious activity)

Alert Metadata: Timestamp, source IP, destination IP, user, device, signature/rule, severity, confidence score

Phase 2: Triage (Critical Tier 1 Function)

True Positive (TP)

Real security incident requiring response

Escalate immediately, initiate IR

False Positive (FP)

Benign activity incorrectly flagged

Close, tune detection rule

Benign True Positive (BTP)

Suspicious but authorized activity

Document, add to whitelist

Phase 3: Investigation

Key Investigation Questions:

What happened? Describe the event in plain language
When? Establish timeline of activity
Where? Which systems, networks, or users are affected?
Who? User accounts, threat actors, or processes involved
How? Attack vector and techniques used (map to MITRE ATT&CK)
Why? Motivation or business impact

Data Sources: SIEM logs, EDR telemetry, firewall logs, proxy logs, authentication logs, DNS logs, email headers

Phase 4: Escalation

Escalation Criteria (Tier 1 → Tier 2):

Confirmed malware or intrusion
Data exfiltration indicators
Lateral movement detected
Privilege escalation attempts
Multiple affected systems
Executive or critical system involvement
Complexity beyond Tier 1 scope

Escalation to Management/Legal: Data breach, ransomware, regulatory incident, major business impact

Phase 5: Response & Phase 6: Resolution

Response Actions: Isolate affected systems, disable accounts, block IPs/domains, remove malware, patch vulnerabilities, reset credentials

Documentation Requirements: Incident timeline, actions taken, systems affected, root cause, remediation steps, lessons learned

Closure Checklist: Threat eradicated, systems restored, monitoring in place, stakeholders notified, documentation complete

Alert Triage Methodology

Triage is the most critical skill for Tier 1 analysts. Effective triage reduces noise, prevents alert fatigue, and ensures real threats are escalated promptly. Poor triage leads to missed incidents or wasted resources.

The Triage Decision Tree

START: New Alert Received
    ↓
Question 1: Is this activity malicious or suspicious?
    ↓
    NO → FALSE POSITIVE
        → Document reason
        → Tune detection rule if needed
        → Close ticket
    ↓
    YES → Continue
    ↓
Question 2: Is this activity authorized or expected?
    ↓
    YES → BENIGN TRUE POSITIVE
        → Verify authorization
        → Document exception
        → Add to whitelist if recurring
        → Close ticket
    ↓
    NO → TRUE POSITIVE
        → Assess severity and urgency
        → Escalate to Tier 2
        → Begin containment if time-critical
            

Detailed Triage Categories

1. True Positive (TP) - Real Threat

Indicators:

Known malicious IP/domain communication
Malware file hash match on VirusTotal
Exploitation of known vulnerability
Credential theft or brute force success
Unauthorized data access or exfiltration
Command-and-control (C2) beaconing

Example: EDR alert for PowerShell executing encoded commands, investigation shows download of Cobalt Strike beacon from known malicious domain.

Action: Escalate immediately with HIGH severity. Include all context: affected user/host, IOCs, initial containment actions.

2. False Positive (FP) - Benign Activity

Common Causes:

Overly broad detection signatures
Legitimate tools flagged as malicious (Admin tools, pentesting software)
Normal business processes triggering behavioral rules
Outdated threat intelligence (old IOCs)
Misconfigured security tools

Example: IDS alert for SQL injection, investigation shows automated vulnerability scanner from authorized security team.

Action: Close ticket as FP. Document the reason. Submit rule tuning request to reduce future FPs. Consider whitelisting source.

3. Benign True Positive (BTP) - Authorized But Flagged

Common Scenarios:

IT admin using remote access tools outside business hours
Developer accessing production database per change control
Security team running penetration tests
Authorized third-party vendor access
Unusual but legitimate user travel (VPN from foreign country)

Example: Alert for abnormal login time and location. User confirms they are traveling internationally for business.

Action: Verify authorization through ticketing system, email, or manager confirmation. Document justification. Add exception if recurring.

Triage Best Practices

Speed vs. Accuracy

Balance is critical. Rapid triage prevents alert backlog, but rushing leads to missed threats. Aim for 5-15 minutes per alert depending on complexity.

Context is Everything

Never make decisions based solely on the alert. Check user role, system criticality, time of day, geolocation, recent changes.

Document Everything

Your notes may be reviewed during audits or legal proceedings. Include what you checked, why you made your decision, and next steps.

When in Doubt, Escalate

It's better to escalate a questionable alert than to close a real incident. Tier 2 can always de-escalate if needed.

Common Triage Mistakes

Confirmation Bias: Seeing what you expect rather than what's there (assuming all alerts are FPs)
Alert Fatigue: Closing alerts without investigation due to high volume
Insufficient Context: Making decisions without checking related logs
Over-Reliance on Severity: Dismissing low-severity alerts that are part of a larger attack
Poor Documentation: Not recording triage logic for future reference

Escalation Procedures and Criteria

Effective escalation ensures that the right expertise is applied to each incident while preventing bottlenecks. Understanding when, how, and to whom to escalate is essential for SOC efficiency.

Escalation Decision Matrix

LOW Severity

Tier 1 Handles

Single user affected, no data loss, known FP pattern, standard remediation available

Example: Phishing email blocked by gateway

MEDIUM Severity

Escalate to Tier 2

Multiple users, confirmed malware, lateral movement indicators, requires investigation

Example: Trojan detected on workstation, contained but needs analysis

HIGH Severity

Escalate to Tier 3

Critical systems, data exfiltration, advanced techniques, zero-day exploit

Example: Ransomware encryption across file servers

CRITICAL Severity

Incident Commander

Active breach, widespread impact, executive involvement, regulatory reporting needed

Example: Nation-state APT with confirmed data exfiltration

Escalation Criteria by Category

Technical Escalation (Tier 1 → Tier 2)

Escalate when:

Confirmed malware that bypassed preventive controls
Successful exploitation of a vulnerability
Evidence of credential compromise or privilege escalation
Lateral movement between systems detected
Data exfiltration indicators (large uploads, unusual protocols)
Multiple related alerts suggesting coordinated attack
Investigation requires forensic tools or deep analysis
Runbook doesn't cover the scenario

Management Escalation (SOC → Leadership)

Escalate when:

Incident affects executive systems or data
Business-critical systems are compromised or unavailable
Suspected data breach requiring regulatory notification
Media or public attention likely
Attack suggests targeted campaign or APT
Financial fraud or wire transfer compromise
Response requires significant business decisions (shutdown systems, notify customers)

External Escalation

Legal / Compliance

PII/PHI data breach
Regulatory reporting required (GDPR, HIPAA, PCI)
Law enforcement involvement needed
Contractual breach notification

‍ Executive Management

Business continuity impact
Reputational risk
Strategic decision required
Major financial impact

IT Operations

System patching required
Network changes needed
Service restoration
Configuration changes

External IR / Law Enforcement

Capabilities exceeded
Criminal investigation
Advanced forensics needed
Nation-state actor

Effective Escalation Communication

GOOD Escalation Example:

SUBJECT: [HIGH] Confirmed Malware - User jdoe - Finance Workstation

SUMMARY: Tier 1 confirmed malware on Finance user workstation. System isolated.
Escalating for malware analysis and scope determination.

DETAILS:
- Alert: EDR behavioral detection "Suspicious PowerShell Activity"
- User: jdoe (Finance Department - Payroll Access)
- Host: FIN-WS-042 (10.20.30.42)
- Time: 2025-12-21 14:32 UTC
- Initial Triage: User clicked email link, PowerShell downloaded and executed file from
  hxxp://malicious-domain[.]com/payload.exe
- VirusTotal: 42/70 engines detect as Emotet variant
- Actions Taken: Host isolated via EDR, user notified, manager informed
- Urgency: User has access to payroll systems and bank account information

ESCALATION REASON: Requires malware analysis, lateral movement check, and
credential reset scope determination.

ATTACHMENTS: Screenshot of EDR alert, VirusTotal report, initial timeline
            

BAD Escalation Example (Don't Do This):

SUBJECT: Alert

There's an alert on some computer. Looks bad. Can someone check it out?

Problems: No context, no urgency, no details, no actions taken, unprofessional

Escalation Best Practices

Include all relevant context (who, what, when, where, why)
State what you've already checked and ruled out
Clearly articulate why escalation is needed
Attach supporting evidence (logs, screenshots, reports)
Set appropriate urgency/severity level
Provide your initial assessment and recommendations
Follow your SOC's escalation SLA (typically 15-30 min for high severity)

Shift Handoff and Communication

Effective shift handoff is critical in a 24/7 SOC. Poor handoffs lead to missed incidents, duplicated work, and alert escalation delays. Treat handoff as a formal process, not an afterthought.

Why Shift Handoff Matters

Consequences of Poor Handoff:

Ongoing incidents fall through the cracks
Next shift re-investigates already triaged alerts (wasted effort)
Context is lost, delaying incident response
Escalations are delayed or forgotten
Alert fatigue increases from duplicate work

Shift Handoff Components

1. Ongoing Incidents

For each active incident, document:

Incident ID and Summary: Ticket number and one-line description
Current Status: Investigation, containment, waiting for external input
Actions Taken: What has been done so far
Next Steps: What needs to happen next and by when
Waiting On: Any blockers or dependencies (IT team, vendor response)
Severity/Urgency: How critical is immediate action

2. Pending Alerts

Alert queue status:

Backlog Count: How many alerts are still untriaged
Priority Alerts: Any high-severity alerts that need immediate attention
Trends: Spike in specific alert types (may indicate ongoing attack)
Known Issues: Tool malfunctions causing alert storms

3. Escalations and Follow-ups

Track escalated items:

Tickets escalated to Tier 2/3 awaiting response
Items escalated to IT/management needing follow-up
Expected callback times from vendors or external teams
Scheduled maintenance or changes that may cause alerts

4. Environmental Notes

Situational awareness:

Scheduled maintenance windows (patching, upgrades)
Known false positive sources being investigated
New detection rules deployed (may cause alert increase)
Tool outages or degraded performance
Ongoing threat campaigns (phishing wave, ransomware targeting industry)

Handoff Format Example

SOC SHIFT HANDOFF REPORT
Shift: Day Shift (08:00 - 16:00 UTC)
Date: 2025-12-21
Analyst: Sarah Chen
Next Shift: Evening (16:00 - 00:00 UTC) - Mike Johnson

═══════════════════════════════════════════════════════════════

ONGOING INCIDENTS (Action Required):

[INC-12456] HIGH - Suspected Credential Stuffing Attack
  Status: Investigation in progress
  Summary: Multiple failed logins from distributed IPs targeting VPN portal
  Actions Taken:
    - Identified 47 targeted accounts
    - Blocked 23 malicious IPs via firewall
    - Notified affected users to reset passwords
  Next Steps:
    - Tier 2 performing log analysis for successful logins (ETA 17:00)
    - Monitor for additional login attempts
  Urgency: HIGH - Active attack

[INC-12461] MEDIUM - Malware Quarantined on HR Workstation
  Status: Containment complete, awaiting final verification
  Summary: Emotet trojan quarantined by EDR, no execution occurred
  Actions Taken:
    - EDR quarantined file automatically
    - Verified no C2 communication
    - User educated on phishing
  Next Steps:
    - IT patching system tonight
    - Close ticket after patch verification tomorrow
  Urgency: LOW - Contained

═══════════════════════════════════════════════════════════════

ALERT QUEUE STATUS:

Total Pending: 12 alerts
  - HIGH: 0
  - MEDIUM: 3 (prioritize IDS alerts from DMZ)
  - LOW: 9

Trend: Increase in blocked phishing emails (34 today vs. 12 avg)
  - Appears to be targeting Finance department
  - Threat intel notified, investigating campaign

═══════════════════════════════════════════════════════════════

ESCALATIONS PENDING RESPONSE:

- Ticket #12450: Escalated to IT Ops for patch deployment (waiting since 12:00)
- Ticket #12455: Escalated to Tier 2 for C2 beacon analysis (under investigation)

═══════════════════════════════════════════════════════════════

ENVIRONMENTAL NOTES:

- Scheduled firewall maintenance 20:00-22:00 (may lose connectivity alerts)
- New SIEM rule deployed for detecting Kerberoasting (may see initial FPs)
- CISO requested daily summary of phishing metrics (send by EOD)

═══════════════════════════════════════════════════════════════

METRICS (This Shift):

Alerts Triaged: 87
  - True Positives: 3
  - False Positives: 76
  - Benign True Positives: 8

Incidents Created: 4
Escalations: 2
MTTD: 8 minutes
MTTR: 32 minutes

═══════════════════════════════════════════════════════════════

Questions? Contact me: sarah.chen@company.com / ext. 5423
            

Handoff Best Practices

Document During Shift

Don't wait until handoff time to document. Update your handoff notes throughout the shift so you don't forget critical details.

Verbal + Written

Overlap shifts if possible for 15-30 min verbal handoff. Walk through critical items. Written doc is backup, not replacement.

Flag the Critical

Clearly mark items needing immediate attention. Use severity tags, bold text, or highlighting so they stand out.

Encourage Questions

Make sure incoming shift understands and has your contact info. Ambiguity leads to mistakes.

Common Handoff Failures

"Everything's fine": Even if quiet, document what you checked and current queue status
Incomplete Context: "Some server has an issue" - be specific about what, where, severity
No Next Steps: Leaving incoming shift to figure out what to do next
Lost Escalations: Forgetting to mention items sent to Tier 2 or management
Tribal Knowledge: Assuming next shift knows about ongoing situations or tool quirks

SOC Metrics and Key Performance Indicators

Metrics provide visibility into SOC performance, identify areas for improvement, and demonstrate value to stakeholders. However, metrics must be meaningful and actionable—avoid "vanity metrics" that look good but don't drive improvement.

Core SOC Metrics

1. Mean Time to Detect (MTTD)

MTTD

12 minutes

Definition: Average time from when an attack begins to when it's detected

Calculation: Sum of (Detection Time - Incident Start Time) / Number of Incidents

Target: <15 minutes for most organizations, <5 minutes for high-security environments

Why It Matters: Faster detection limits attacker dwell time and reduces damage

Improvement Strategies: Better detection rules, threat intelligence integration, proactive hunting

2. Mean Time to Respond (MTTR)

MTTR

45 minutes

Definition: Average time from detection to containment/remediation

Calculation: Sum of (Resolution Time - Detection Time) / Number of Incidents

Target: <60 minutes for high-severity incidents, varies by severity

Why It Matters: Speed of response directly correlates with limiting blast radius

Improvement Strategies: Automated response playbooks, runbooks, training, clear escalation paths

3. Alert Volume and Triage Metrics

Alerts per Day

1,247

Total security alerts generated. High volume indicates potential tuning needs.

True Positive Rate

4.2%

Percentage of alerts that are real incidents. Typical: 2-10%

False Positive Rate

87.3%

Percentage of alerts that are benign. Goal: <90%, ideally <80%

Benign TP Rate

8.5%

Authorized activity flagged as suspicious. Whitelist candidates.

Alert Volume Health Check:

<500 alerts/day: Possibly insufficient monitoring coverage or over-tuned rules
500-2000 alerts/day: Healthy for medium-sized organizations
2000-5000 alerts/day: Manageable with automation and proper staffing
>5000 alerts/day: Risk of alert fatigue, aggressive tuning needed

4. Incident Metrics

Incidents per Month

23

Total confirmed security incidents requiring response

Critical Incidents

2

High-impact incidents requiring management notification

Incident Backlog

5

Open incidents not yet resolved. Should trend toward zero.

Recurring Incidents

18%

Percentage of repeat incidents. Indicates root cause not addressed.

5. Coverage and Visibility Metrics

Log Source Coverage

94% of critical assets sending logs to SIEM

Target: >95% for critical systems, >80% for all systems

EDR Deployment

98% of endpoints with EDR agent installed

Target: >95% for workstations, 100% for servers

MITRE ATT&CK Coverage

78% of techniques have detection coverage

Target: >70% overall, >90% for high-priority techniques

Detection Rule Health

87% of rules firing in last 30 days

Dead rules should be reviewed and retired/improved

6. Analyst Performance Metrics

Use with Caution: Analyst metrics can be helpful for training but should never be weaponized. Focusing too heavily on individual metrics creates perverse incentives (e.g., closing tickets quickly without proper investigation).

Alerts Triaged per Shift: Productivity indicator (typical: 50-100 depending on complexity)
Triage Accuracy: Percentage of triage decisions upheld by Tier 2 review
Escalation Rate: Percentage of alerts escalated (typical: 5-15%)
Documentation Quality: Completeness of ticket notes (subjective, peer-reviewed)
SLA Compliance: Percentage of alerts triaged within SLA time (e.g., 15 min)

How to Use Metrics Effectively

Trend Over Time: Don't obsess over single data points. Look for trends over weeks/months.
Context Matters: Spike in alerts may be due to new detection rule, not worsening security.
Drive Action: Metrics should lead to concrete improvements (tuning, training, tool changes).
Communicate Value: Use metrics to show leadership the SOC's impact and justify resources.
Balance Leading and Lagging: MTTD/MTTR are lagging (measure past). Coverage is leading (predict future).
Avoid Vanity Metrics: "Blocked 1 million threats" sounds impressive but lacks context. What's the trend? The impact?

Dashboard Example

SOC WEEKLY DASHBOARD - Week of 2025-12-15 ┌─────────────────────────────────────────────────────────────┐ │ DETECTION & RESPONSE │ ├─────────────────────────────────────────────────────────────┤ │ Mean Time to Detect (MTTD): 11 min [↓ -2 min] ✓ │ │ Mean Time to Respond (MTTR): 38 min [↓ -7 min] ✓ │ │ Incidents Created: 18 [↑ +3] │ │ Critical Incidents: 1 [→ same] │ └─────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────┐ │ ALERT METRICS │ ├─────────────────────────────────────────────────────────────┤ │ Total Alerts: 8,734 [↑ +12%] │ │ True Positive Rate: 3.8% [→ same] │ │ False Positive Rate: 88.1% [↑ +3%] │ │ Avg Triage Time: 7 min [→ same] │ └─────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────┐ │ COVERAGE │ ├─────────────────────────────────────────────────────────────┤ │ Log Source Coverage: 94% [→ same] │ │ EDR Agent Deployment: 97% [↑ +1%] ✓ │ │ Detection Rules Active: 247 [↑ +5] │ └─────────────────────────────────────────────────────────────┘ KEY FINDINGS: ✓ Response times improved due to new SOAR playbooks Alert volume spike from new cloud monitoring (tuning in progress) FP rate increase due to new rule deployment (tuning scheduled)

SOC Runbooks and Playbooks

Runbooks and playbooks are essential SOC documentation that standardizes incident response, reduces decision fatigue, and ensures consistent handling of common scenarios. They're especially critical for Tier 1 analysts who may encounter unfamiliar situations.

Runbook vs. Playbook

Runbook

Step-by-Step Investigation Guide

Purpose: Provide detailed instructions for investigating a specific alert type or scenario

Scope: Single alert type or specific detection rule

Audience: Primarily Tier 1 analysts

Example: "Runbook for Investigating Brute Force Login Alerts"

Contents: Triage steps, data to collect, decision tree, escalation criteria

Playbook

End-to-End Response Workflow

Purpose: Orchestrate complete incident response from detection through recovery

Scope: Entire incident category or attack type

Audience: All SOC tiers, IR team, IT, management

Example: "Ransomware Incident Response Playbook"

Contents: Response phases, roles & responsibilities, communication plan, containment/eradication steps

Sample Runbook: Malware Detection Alert

RUNBOOK: EDR Malware Detection Alert
Version: 2.1 | Last Updated: 2025-12-01 | Owner: SOC Team

═══════════════════════════════════════════════════════════════

1. INITIAL TRIAGE (5 minutes)

□ Verify alert details in EDR console (CrowdStrike/SentinelOne)
  - Alert timestamp, severity, detection method
  - Affected hostname and IP address
  - Username and user department
  - Malware family/type identified
  - Current host status (online, isolated, offline)

□ Check VirusTotal/hybrid-analysis.com for file hash
  - Upload hash (NOT the file itself)
  - Document detection ratio (e.g., 45/70)
  - Note malware family classification

□ Determine if system is already isolated
  - If YES: Proceed to investigation
  - If NO: Consider immediate isolation if high severity

DECISION POINT:
  - 0-10 vendors detect: Likely FP → Verify with vendor, document, close
  - 10-30 vendors detect: Investigate further (proceed to step 2)
  - 30+ vendors detect: Confirmed malware → Isolate immediately, escalate

═══════════════════════════════════════════════════════════════

2. CONTEXT GATHERING (10 minutes)

□ User Information
  - Check Active Directory: User role, department, privileges
  - Contact user: Ask about recent downloads, email clicks
  - Recent access: Check if user accessed sensitive systems

□ Host Information
  - Asset criticality: Production server? Executive workstation?
  - Check CMDB for system purpose and data classification
  - Recent changes: Software installs, patches applied?

□ Malware Execution Status
  - Was file executed or just downloaded?
  - Check EDR process tree for indicators of execution
  - Look for persistence mechanisms (registry, scheduled tasks)

□ Network Activity
  - Check firewall/proxy logs for C2 communication
  - Look for data exfiltration (large uploads, unusual protocols)
  - Identify any lateral movement attempts

DECISION POINT:
  - File quarantined before execution: Lower priority, likely can handle at Tier 1
  - File executed with C2 communication: ESCALATE TO TIER 2 IMMEDIATELY
  - Multiple hosts affected: ESCALATE TO TIER 2 IMMEDIATELY

═══════════════════════════════════════════════════════════════

3. CONTAINMENT (If Not Already Isolated)

□ For confirmed malware with execution:
  - Isolate host via EDR (prevents network communication)
  - Disable user account in AD
  - Block C2 IPs/domains at firewall
  - Alert other teams (IT, IR) via Slack/email

□ Document containment actions with timestamps

═══════════════════════════════════════════════════════════════

4. TRIAGE CLASSIFICATION

FALSE POSITIVE - Close ticket if:
  ✓ Low VirusTotal detection (<10)
  ✓ File is known legitimate software
  ✓ Vendor confirms as FP
  → Action: Document reason, submit FP to vendor, close

TRUE POSITIVE (Tier 1 Handles) - Continue if:
  ✓ Malware quarantined before execution
  ✓ No C2 communication detected
  ✓ Single host affected
  ✓ Not a critical system or privileged user
  → Action: Proceed to step 5

TRUE POSITIVE (Escalate to Tier 2) - Escalate if:
  ✗ Malware executed successfully
  ✗ C2 communication detected
  ✗ Multiple hosts affected
  ✗ Critical system or executive user
  ✗ Ransomware indicators
  → Action: Create high-priority escalation ticket with all context

═══════════════════════════════════════════════════════════════

5. REMEDIATION (Tier 1 - Simple Cases Only)

□ Verify malware quarantined by EDR
□ Run full endpoint scan to ensure no other malware
□ Check for scheduled tasks, registry run keys (persistence)
□ User education: Send phishing awareness reminder
□ Document incident in ticket with full timeline
□ Un-isolate host once verified clean
□ Re-enable user account
□ Monitor for 24 hours for re-infection

═══════════════════════════════════════════════════════════════

6. DOCUMENTATION

Required fields in ticket:
  - Malware family and hash
  - VirusTotal detection ratio
  - Execution status (yes/no)
  - C2 communication (yes/no)
  - User notification (yes/no)
  - Containment actions taken
  - Final disposition (TP/FP/BTP)

═══════════════════════════════════════════════════════════════

ESCALATION CRITERIA:
  → Malware executed + C2 communication
  → Ransomware indicators
  → Multiple hosts infected
  → Executive or critical system
  → Unusual/unknown malware family
  → Analyst unsure of next steps

CONTACTS:
  Tier 2 Escalation: tier2@company.com / Slack #soc-tier2
  EDR Support: edr-support@company.com
  User Support: helpdesk@company.com
            

Key Elements of Effective Runbooks

Clear Steps

Use checkboxes, numbered steps, and action verbs. Avoid ambiguity like "check for suspicious activity"—specify what to check and where.

Time Estimates

Help analysts manage their time and identify when they're going down a rabbit hole.

Decision Points

Clearly defined "if this, then that" logic helps analysts make confident triage decisions.

Examples

Include screenshots, sample logs, and example scenarios to illustrate concepts.

Contact Info

Who to escalate to, who to call for help, and where to find additional resources.

Version Control

Date, version number, and owner. Outdated runbooks are worse than no runbooks.

Common Runbook Topics

Brute Force / Password Spray Attacks
Phishing Email Investigations
Malware Detection (EDR/AV alerts)
Suspicious PowerShell Activity
Unusual Login Location / Impossible Travel
DDoS Attack Response
Data Exfiltration Indicators
Privilege Escalation Attempts
Web Application Attacks (SQLi, XSS)
Insider Threat Indicators

Runbook Maintenance Best Practices

Review and update runbooks quarterly or after major incidents
Incorporate lessons learned from post-incident reviews
Get feedback from analysts who actually use them
Test runbooks during tabletop exercises
Retire outdated runbooks rather than letting them accumulate
Make runbooks easily searchable (wiki, confluence, SharePoint)
Include runbooks in new analyst onboarding training

Runbook Anti-Patterns

Too Vague: "Investigate the alert" without specifics
Too Rigid: No room for analyst judgment or unusual scenarios
Outdated: References tools or processes no longer in use
Overly Complex: 50-page document when 2 pages would suffice
No Ownership: No one accountable for keeping it current

Analyst Well-Being and Burnout Prevention

SOC analyst burnout is a critical issue in cybersecurity. The combination of high stress, shift work, alert fatigue, and constant exposure to threats takes a toll. Sustainable SOC operations require proactive attention to analyst well-being.

The SOC Burnout Crisis

Industry Statistics:

Average SOC analyst tenure: 18-24 months (high turnover)
70% of SOC analysts report high stress levels
Alert fatigue cited as top reason for leaving SOC roles
24/7 shift work disrupts sleep and personal life
Constant exposure to threats can lead to anxiety and cynicism

Primary Burnout Contributors

1. Alert Fatigue

The Problem: Hundreds or thousands of alerts daily, most of which are false positives. Analysts become desensitized and may miss real threats.

Solutions:

Aggressive false positive tuning and alert reduction programs
Automation of low-value triage tasks via SOAR
Alert prioritization and risk-based routing
Regular "alert health" reviews to retire noisy, low-value rules
Give analysts permission to question and challenge unhelpful alerts

2. Shift Work Challenges

The Problem: 24/7 coverage requires night shifts, weekend work, and rotating schedules that disrupt circadian rhythms and personal life.

Solutions:

Limit consecutive night shifts (no more than 3-4 in a row)
Allow analyst input on scheduling preferences when possible
Provide adequate shift differential pay for nights/weekends
Consider "follow-the-sun" model with geographically distributed teams
Ensure minimum time off between shifts (8-12 hours)
Provide quiet break rooms and encourage regular breaks

3. Lack of Career Growth

The Problem: Analysts feel stuck in reactive triage work with no clear path forward, leading to frustration and attrition.

Solutions:

Clear career progression paths (Tier 1 → Tier 2 → Tier 3 → Management/Architecture)
Training budgets for certifications (CySA+, GCIH, GCIA)
Rotation opportunities (threat intel, detection engineering, incident response)
Mentorship programs pairing junior analysts with senior staff
Recognition programs for excellent work and continuous improvement
Time allocated for skill development during work hours

4. Repetitive, Low-Value Work

The Problem: Analysts spend 80% of time on simple triage, closing obvious false positives, feeling like "alert button-clickers."

Solutions:

Automate obvious false positives via SOAR playbooks
Allocate time for analysts to work on improvement projects
Encourage detection engineering and rule creation
Threat hunting rotations to break up monotony
Involve analysts in tool selection and process improvement

Organizational Strategies for Preventing Burnout

Realistic Workload

Don't expect analysts to triage 200 alerts per shift. Quality over quantity. Allow time for thorough investigation.

Right Tools

Invest in SIEM, SOAR, and automation. Analysts burn out faster when forced to use clunky, ineffective tools.

Continuous Learning

Provide training, conference attendance, and time to research new attack techniques and defenses.

Psychological Safety

Create a culture where analysts can ask questions, admit mistakes, and escalate without fear of blame.

Recognition

Celebrate wins: incident detection, process improvements, and continuous learning. Analysts need to feel valued.

Work-Life Balance

Enforce PTO usage, respect off-hours, and don't glorify overwork. Burnout helps no one.

Individual Analyst Self-Care

What you can do as an analyst:

Take Breaks: Step away from screens regularly, especially during high-stress incidents
Maintain Sleep Hygiene: Especially critical for shift workers—blackout curtains, consistent sleep schedule
Exercise: Physical activity is proven to reduce stress and improve focus
Build Community: Connect with other SOC analysts, share experiences, learn from each other
Set Boundaries: Don't check work email/Slack on days off unless on-call
Pursue Interests: Hobbies and activities outside cybersecurity provide mental recovery
Seek Support: Don't hesitate to use EAP (Employee Assistance Programs) or talk to a counselor
Know When to Move On: If a role is consistently damaging your health, it's okay to find a different position

Signs of Burnout to Watch For

Emotional Exhaustion: Feeling drained, cynical, detached from work
Reduced Performance: Difficulty concentrating, making mistakes, missing details
Physical Symptoms: Headaches, insomnia, digestive issues, frequent illness
Cynicism: "Nothing matters," "all alerts are false positives," detachment from impact of work
Irritability: Short temper, conflicts with colleagues, negative attitude
Absenteeism: Calling in sick more often, dreading going to work

If you notice these signs in yourself or colleagues, speak up and seek support.

Remember: You're Protecting People

SOC work can feel thankless—most of what you do prevents incidents that never happen. But your work matters enormously. You protect:

Customer data and privacy
Employee personal information
Business operations and revenue
Your organization's reputation
Jobs and livelihoods of your colleagues

Your vigilance, even when handling the 100th false positive of the day, keeps the organization safe. That's valuable work worthy of respect and sustainability.