Basic Regular Expressions

Pattern matching for log analysis and threat hunting

Week 4 TOPIC

Regular Expressions in Security

Regular expressions (regex) are powerful pattern matching tools essential for SOC analysts. They are used in SIEM queries (Splunk, Sentinel, Elastic), log analysis, threat hunting, IDS/IPS rules (Snort, Suricata), DLP policies, and parsing security data. A SOC analyst who cannot write regex is limited to pre-built queries -- mastering regex unlocks the ability to hunt for any pattern in any log source.

Where Regex Appears in SOC Work

SIEM Queries

Splunk's rex command, Sentinel's extract(), Elastic's grok patterns -- all use regex to parse unstructured log data into searchable fields. Example: extracting IP addresses from freeform alert descriptions.

IDS/IPS Rules

Snort and Suricata use PCRE (Perl-Compatible Regular Expressions) in detection rules to match malicious patterns in network traffic. Matching SQL injection, XSS, command injection payloads.

DLP Policies

Data Loss Prevention tools use regex to identify sensitive data: SSN patterns, credit card numbers (with Luhn validation), passport numbers, medical record IDs.

Threat Hunting

Hunting for IOCs across terabytes of logs. Match domain patterns (DGA detection), extract hashes, find encoded PowerShell commands, identify base64-encoded payloads.

Common Patterns Every Analyst Needs

Use CasePatternWhat It Matches
IPv4 Address\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}192.168.1.100, 10.0.0.1
Domain Name[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}evil.com, sub.domain.co.uk
Email Address[\w.-]+@[\w.-]+\.\w+user@company.com
Windows File Path[A-Z]:\\[\w\\.-]+C:\Windows\System32\cmd.exe
MD5 Hash[a-fA-F0-9]{32}d41d8cd98f00b204e9800998ecf8427e
SHA-256 Hash[a-fA-F0-9]{64}e3b0c44298fc1c149afbf4c8996fb924...
Base64 String[A-Za-z0-9+/]{20,}={0,2}SW52b2tlLUV4cHJlc3Npb24=
SSN (US)\d{3}-\d{2}-\d{4}123-45-6789
Source: Script House > CLH > Advanced Grep Open Grep Training

Regex Syntax Reference

Basic Metacharacters

CharacterMeaningExampleMatches
.Any single charactera.c"abc", "a1c", "a-c"
^Start of line^ErrorLines starting with "Error"
$End of linefailed$Lines ending with "failed"
*Zero or moreab*c"ac", "abc", "abbc"
+One or moreab+c"abc", "abbc" (not "ac")
?Zero or onecolou?r"color" and "colour"
[]Character class[aeiou]Any single vowel
[^]Negated class[^0-9]Any non-digit character
\dAny digit (0-9)\d{3}"123", "456", "789"
\wWord character (a-z, A-Z, 0-9, _)\w+"hello", "test123"
\sWhitespace (space, tab, newline)\s+One or more spaces/tabs
\bWord boundary\bcat\b"cat" (not "category")
|Alternation (OR)cat|dog"cat" or "dog"
()Grouping / capture(ab)+"ab", "abab", "ababab"

Quantifiers

{n} Exactly n times \d{4} matches "2024" {n,} n or more times \d{2,} matches "12", "123", "1234" {n,m} Between n and m \d{1,3} matches "1", "12", "123" ? Lazy (non-greedy) .*? matches as little as possible

Escaping Special Characters

When you want to match a literal special character, precede it with a backslash:

To MatchEscape AsContext
Literal dot (.)\.IP addresses: \d+\.\d+\.\d+\.\d+
Literal backslash (\)\\Windows paths: C:\\Windows\\
Literal dollar ($)\$Variable names: \$HOME
Literal bracket ([)\[Log formats: \[ERROR\]
Literal pipe (|)\|CSV-like logs with pipe delimiters
SOC Analyst Tip

SIEM-Specific Syntax: Each SIEM has slightly different regex syntax. Splunk uses PCRE in | rex commands. Sentinel KQL uses extract() or matches regex. Elastic uses Lucene regex (no lookaheads). Always test your regex on sample data before deploying as a detection rule -- a bad regex can crash your SIEM query or return zero results.

Security Analysis Patterns

SOC Regex Patterns for Common Threats

# IPv4 Address (with word boundaries to avoid partial matches) \b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b # Private IPv4 Ranges (RFC 1918) \b(10\.\d{1,3}\.\d{1,3}\.\d{1,3}|172\.(1[6-9]|2\d|3[01])\.\d{1,3}\.\d{1,3}|192\.168\.\d{1,3}\.\d{1,3})\b # Base64 Encoded Strings (potential obfuscation) [A-Za-z0-9+/]{20,}={0,2} # Windows Executable Path [A-Za-z]:\\[\w\s.-\\]+\.(exe|dll|bat|ps1|vbs|cmd|msi|hta) # SQL Injection Attempt (in web logs) ('|--|;|/\*|\*/|xp_|union\s+select|or\s+1\s*=\s*1) # Suspicious PowerShell (encoded commands, download cradles) (Invoke-|IEX|downloadstring|encodedcommand|-enc\s|-e\s|Net\.WebClient|FromBase64String) # Potential DGA Domain (high consonant ratio, long subdomain) [bcdfghjklmnpqrstvwxyz]{4,}\.[a-z]{2,6}\b # Credit Card Number (basic PAN detection for DLP) \b[3-6]\d{3}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b # US Social Security Number \b\d{3}-\d{2}-\d{4}\b

SIEM Query Examples

PlatformQueryPurpose
Splunk index=auth action=failure | regex src_ip!="^(10\.|172\.(1[6-9]|2\d|3[01])\.|192\.168\.)" Failed logins from external (non-RFC1918) IPs
Splunk index=proxy | rex field=url "(?<domain>[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})" | stats count by domain Extract and count domains from proxy URLs
Sentinel SecurityEvent | where EventID == 4688 | where CommandLine matches regex @"(?i)(invoke-|iex|downloadstring|frombase64)" Detect suspicious PowerShell execution
grep grep -E '(;|&&|\|\|)\s*(cat|ls|id|whoami|wget|curl)' access.log Find command injection attempts in web logs
grep grep -oP '\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}' auth.log | sort | uniq -c | sort -rn | head Top source IPs in authentication logs

Regex Anti-Patterns (Common Mistakes)

Catastrophic Backtracking

Pattern (a+)+$ on a long string of "a"s causes exponential processing time. Can crash your SIEM or lag your workstation. Always test regex performance on large datasets before deploying.

Overly Broad Matching

Pattern .* matches everything. Using \d+\.\d+\.\d+\.\d+ without word boundaries matches "1234.5678.9012.3456" which is not a valid IP. Always use \b word boundaries for precision.

Forgetting to Escape

Pattern 192.168.1.1 matches "192x168y1z1" because unescaped dots match any character. Always escape literal dots: 192\.168\.1\.1

SOC Analyst Tip

Sigma Rules: Sigma is a generic signature format for SIEM detections. Sigma rules use regex extensively and can be converted to Splunk SPL, Sentinel KQL, Elastic queries, and more. Learning regex once lets you write detection rules that work across all platforms. See github.com/SigmaHQ/sigma for the community rule repository.

Interactive Regex Lab

Regex Tester

Enter a regex pattern and test it against log data. Matches will be highlighted in the results.

Results will appear here...

Regex Challenges

Write a regex pattern that matches the described target. Your pattern will be tested against sample data to verify it works.

Knowledge Check

1. What does the regex metacharacter \d match?

2. Which pattern matches "color" and "colour"?

3. What does ^ mean at the start of a regex?

4. Which quantifier means "exactly 3 times"?

5. To match a literal dot in regex, you should use:

6. What Splunk command uses regex to extract fields from log data?

7. What is "catastrophic backtracking" in regex?