Basic Regular Expressions

Pattern matching for log analysis and threat hunting

Week 4 TOPIC

Regular Expressions in Security

Regular expressions (regex) are powerful pattern matching tools essential for SOC analysts. They are used in SIEM queries (Splunk, Sentinel, Elastic), log analysis, threat hunting, IDS/IPS rules (Snort, Suricata), DLP policies, and parsing security data. A SOC analyst who cannot write regex is limited to pre-built queries -- mastering regex unlocks the ability to hunt for any pattern in any log source.

Where Regex Appears in SOC Work

SIEM Queries

Splunk's rex command, Sentinel's extract(), Elastic's grok patterns -- all use regex to parse unstructured log data into searchable fields. Example: extracting IP addresses from freeform alert descriptions.

IDS/IPS Rules

Snort and Suricata use PCRE (Perl-Compatible Regular Expressions) in detection rules to match malicious patterns in network traffic. Matching SQL injection, XSS, command injection payloads.

DLP Policies

Data Loss Prevention tools use regex to identify sensitive data: SSN patterns, credit card numbers (with Luhn validation), passport numbers, medical record IDs.

Threat Hunting

Hunting for IOCs across terabytes of logs. Match domain patterns (DGA detection), extract hashes, find encoded PowerShell commands, identify base64-encoded payloads.

Common Patterns Every Analyst Needs

Use Case	Pattern	What It Matches
IPv4 Address	`\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}`	192.168.1.100, 10.0.0.1
Domain Name	`[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}`	evil.com, sub.domain.co.uk
Email Address	`[\w.-]+@[\w.-]+\.\w+`	user@company.com
Windows File Path	`[A-Z]:\\[\w\\.-]+`	C:\Windows\System32\cmd.exe
MD5 Hash	`[a-fA-F0-9]{32}`	d41d8cd98f00b204e9800998ecf8427e
SHA-256 Hash	`[a-fA-F0-9]{64}`	e3b0c44298fc1c149afbf4c8996fb924...
Base64 String	`[A-Za-z0-9+/]{20,}={0,2}`	SW52b2tlLUV4cHJlc3Npb24=
SSN (US)	`\d{3}-\d{2}-\d{4}`	123-45-6789

Source: Script House > CLH > Advanced Grep Open Grep Training

Regex Syntax Reference

Basic Metacharacters

Character	Meaning	Example	Matches
`.`	Any single character	`a.c`	"abc", "a1c", "a-c"
`^`	Start of line	`^Error`	Lines starting with "Error"
`$`	End of line	`failed$`	Lines ending with "failed"
`*`	Zero or more	`ab*c`	"ac", "abc", "abbc"
`+`	One or more	`ab+c`	"abc", "abbc" (not "ac")
`?`	Zero or one	`colou?r`	"color" and "colour"
`[]`	Character class	`[aeiou]`	Any single vowel
`[^]`	Negated class	`[^0-9]`	Any non-digit character
`\d`	Any digit (0-9)	`\d{3}`	"123", "456", "789"
`\w`	Word character (a-z, A-Z, 0-9, _)	`\w+`	"hello", "test123"
`\s`	Whitespace (space, tab, newline)	`\s+`	One or more spaces/tabs
`\b`	Word boundary	`\bcat\b`	"cat" (not "category")
`\|`	Alternation (OR)	`cat\|dog`	"cat" or "dog"
`()`	Grouping / capture	`(ab)+`	"ab", "abab", "ababab"

Quantifiers

{n} Exactly n times \d{4} matches "2024" {n,} n or more times \d{2,} matches "12", "123", "1234" {n,m} Between n and m \d{1,3} matches "1", "12", "123" ? Lazy (non-greedy) .*? matches as little as possible

Escaping Special Characters

When you want to match a literal special character, precede it with a backslash:

To Match	Escape As	Context
Literal dot (.)	`\.`	IP addresses: `\d+\.\d+\.\d+\.\d+`
Literal backslash (\)	`\\`	Windows paths: `C:\\Windows\\`
Literal dollar ($)	`\$`	Variable names: `\$HOME`
Literal bracket ([)	`\[`	Log formats: `\[ERROR\]`
Literal pipe (\|)	`\\|`	CSV-like logs with pipe delimiters

SOC Analyst Tip

SIEM-Specific Syntax: Each SIEM has slightly different regex syntax. Splunk uses PCRE in | rex commands. Sentinel KQL uses extract() or matches regex. Elastic uses Lucene regex (no lookaheads). Always test your regex on sample data before deploying as a detection rule -- a bad regex can crash your SIEM query or return zero results.

Security Analysis Patterns

SOC Regex Patterns for Common Threats

# IPv4 Address (with word boundaries to avoid partial matches) \b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b # Private IPv4 Ranges (RFC 1918) \b(10\.\d{1,3}\.\d{1,3}\.\d{1,3}|172\.(1[6-9]|2\d|3[01])\.\d{1,3}\.\d{1,3}|192\.168\.\d{1,3}\.\d{1,3})\b # Base64 Encoded Strings (potential obfuscation) [A-Za-z0-9+/]{20,}={0,2} # Windows Executable Path [A-Za-z]:\\[\w\s.-\\]+\.(exe|dll|bat|ps1|vbs|cmd|msi|hta) # SQL Injection Attempt (in web logs) ('|--|;|/\*|\*/|xp_|union\s+select|or\s+1\s*=\s*1) # Suspicious PowerShell (encoded commands, download cradles) (Invoke-|IEX|downloadstring|encodedcommand|-enc\s|-e\s|Net\.WebClient|FromBase64String) # Potential DGA Domain (high consonant ratio, long subdomain) [bcdfghjklmnpqrstvwxyz]{4,}\.[a-z]{2,6}\b # Credit Card Number (basic PAN detection for DLP) \b[3-6]\d{3}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b # US Social Security Number \b\d{3}-\d{2}-\d{4}\b

SIEM Query Examples

Platform	Query	Purpose
Splunk	`index=auth action=failure \| regex src_ip!="^(10\.\|172\.(1[6-9]\|2\d\|3[01])\.\|192\.168\.)"`	Failed logins from external (non-RFC1918) IPs
Splunk	`index=proxy \| rex field=url "(?<domain>[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})" \| stats count by domain`	Extract and count domains from proxy URLs
Sentinel	`SecurityEvent \| where EventID == 4688 \| where CommandLine matches regex @"(?i)(invoke-\|iex\|downloadstring\|frombase64)"`	Detect suspicious PowerShell execution
grep	`grep -E '(;\|&&\|\\|\\|)\s*(cat\|ls\|id\|whoami\|wget\|curl)' access.log`	Find command injection attempts in web logs
grep	`grep -oP '\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}' auth.log \| sort \| uniq -c \| sort -rn \| head`	Top source IPs in authentication logs

Regex Anti-Patterns (Common Mistakes)

Catastrophic Backtracking

Pattern (a+)+$ on a long string of "a"s causes exponential processing time. Can crash your SIEM or lag your workstation. Always test regex performance on large datasets before deploying.

Overly Broad Matching

Pattern .* matches everything. Using \d+\.\d+\.\d+\.\d+ without word boundaries matches "1234.5678.9012.3456" which is not a valid IP. Always use \b word boundaries for precision.

Forgetting to Escape

Pattern 192.168.1.1 matches "192x168y1z1" because unescaped dots match any character. Always escape literal dots: 192\.168\.1\.1

SOC Analyst Tip

Sigma Rules: Sigma is a generic signature format for SIEM detections. Sigma rules use regex extensively and can be converted to Splunk SPL, Sentinel KQL, Elastic queries, and more. Learning regex once lets you write detection rules that work across all platforms. See github.com/SigmaHQ/sigma for the community rule repository.

Interactive Regex Lab

Regex Tester

Enter a regex pattern and test it against log data. Matches will be highlighted in the results.

Results will appear here...

Regex Challenges

Write a regex pattern that matches the described target. Your pattern will be tested against sample data to verify it works.

Basic Regular Expressions

Regular Expressions in Security

Where Regex Appears in SOC Work

SIEM Queries

IDS/IPS Rules

DLP Policies

Threat Hunting

Common Patterns Every Analyst Needs

Regex Syntax Reference

Basic Metacharacters

Quantifiers

Escaping Special Characters

Security Analysis Patterns

SOC Regex Patterns for Common Threats

SIEM Query Examples

Regex Anti-Patterns (Common Mistakes)

Catastrophic Backtracking

Overly Broad Matching

Forgetting to Escape

Interactive Regex Lab

Regex Tester

Regex Challenges

Knowledge Check