Pattern matching for log analysis and threat hunting
Regular expressions (regex) are powerful pattern matching tools essential for SOC analysts. They are used in SIEM queries (Splunk, Sentinel, Elastic), log analysis, threat hunting, IDS/IPS rules (Snort, Suricata), DLP policies, and parsing security data. A SOC analyst who cannot write regex is limited to pre-built queries -- mastering regex unlocks the ability to hunt for any pattern in any log source.
Splunk's rex command, Sentinel's extract(), Elastic's grok patterns -- all use regex to parse unstructured log data into searchable fields. Example: extracting IP addresses from freeform alert descriptions.
Snort and Suricata use PCRE (Perl-Compatible Regular Expressions) in detection rules to match malicious patterns in network traffic. Matching SQL injection, XSS, command injection payloads.
Data Loss Prevention tools use regex to identify sensitive data: SSN patterns, credit card numbers (with Luhn validation), passport numbers, medical record IDs.
Hunting for IOCs across terabytes of logs. Match domain patterns (DGA detection), extract hashes, find encoded PowerShell commands, identify base64-encoded payloads.
| Use Case | Pattern | What It Matches |
|---|---|---|
| IPv4 Address | \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3} | 192.168.1.100, 10.0.0.1 |
| Domain Name | [a-zA-Z0-9.-]+\.[a-zA-Z]{2,} | evil.com, sub.domain.co.uk |
| Email Address | [\w.-]+@[\w.-]+\.\w+ | user@company.com |
| Windows File Path | [A-Z]:\\[\w\\.-]+ | C:\Windows\System32\cmd.exe |
| MD5 Hash | [a-fA-F0-9]{32} | d41d8cd98f00b204e9800998ecf8427e |
| SHA-256 Hash | [a-fA-F0-9]{64} | e3b0c44298fc1c149afbf4c8996fb924... |
| Base64 String | [A-Za-z0-9+/]{20,}={0,2} | SW52b2tlLUV4cHJlc3Npb24= |
| SSN (US) | \d{3}-\d{2}-\d{4} | 123-45-6789 |
| Character | Meaning | Example | Matches |
|---|---|---|---|
. | Any single character | a.c | "abc", "a1c", "a-c" |
^ | Start of line | ^Error | Lines starting with "Error" |
$ | End of line | failed$ | Lines ending with "failed" |
* | Zero or more | ab*c | "ac", "abc", "abbc" |
+ | One or more | ab+c | "abc", "abbc" (not "ac") |
? | Zero or one | colou?r | "color" and "colour" |
[] | Character class | [aeiou] | Any single vowel |
[^] | Negated class | [^0-9] | Any non-digit character |
\d | Any digit (0-9) | \d{3} | "123", "456", "789" |
\w | Word character (a-z, A-Z, 0-9, _) | \w+ | "hello", "test123" |
\s | Whitespace (space, tab, newline) | \s+ | One or more spaces/tabs |
\b | Word boundary | \bcat\b | "cat" (not "category") |
| | Alternation (OR) | cat|dog | "cat" or "dog" |
() | Grouping / capture | (ab)+ | "ab", "abab", "ababab" |
When you want to match a literal special character, precede it with a backslash:
| To Match | Escape As | Context |
|---|---|---|
| Literal dot (.) | \. | IP addresses: \d+\.\d+\.\d+\.\d+ |
| Literal backslash (\) | \\ | Windows paths: C:\\Windows\\ |
| Literal dollar ($) | \$ | Variable names: \$HOME |
| Literal bracket ([) | \[ | Log formats: \[ERROR\] |
| Literal pipe (|) | \| | CSV-like logs with pipe delimiters |
SIEM-Specific Syntax: Each SIEM has slightly different regex syntax. Splunk uses PCRE in | rex commands. Sentinel KQL uses extract() or matches regex. Elastic uses Lucene regex (no lookaheads). Always test your regex on sample data before deploying as a detection rule -- a bad regex can crash your SIEM query or return zero results.
| Platform | Query | Purpose |
|---|---|---|
| Splunk | index=auth action=failure | regex src_ip!="^(10\.|172\.(1[6-9]|2\d|3[01])\.|192\.168\.)" |
Failed logins from external (non-RFC1918) IPs |
| Splunk | index=proxy | rex field=url "(?<domain>[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})" | stats count by domain |
Extract and count domains from proxy URLs |
| Sentinel | SecurityEvent | where EventID == 4688 | where CommandLine matches regex @"(?i)(invoke-|iex|downloadstring|frombase64)" |
Detect suspicious PowerShell execution |
| grep | grep -E '(;|&&|\|\|)\s*(cat|ls|id|whoami|wget|curl)' access.log |
Find command injection attempts in web logs |
| grep | grep -oP '\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}' auth.log | sort | uniq -c | sort -rn | head |
Top source IPs in authentication logs |
Pattern (a+)+$ on a long string of "a"s causes exponential processing time. Can crash your SIEM or lag your workstation. Always test regex performance on large datasets before deploying.
Pattern .* matches everything. Using \d+\.\d+\.\d+\.\d+ without word boundaries matches "1234.5678.9012.3456" which is not a valid IP. Always use \b word boundaries for precision.
Pattern 192.168.1.1 matches "192x168y1z1" because unescaped dots match any character. Always escape literal dots: 192\.168\.1\.1
Sigma Rules: Sigma is a generic signature format for SIEM detections. Sigma rules use regex extensively and can be converted to Splunk SPL, Sentinel KQL, Elastic queries, and more. Learning regex once lets you write detection rules that work across all platforms. See github.com/SigmaHQ/sigma for the community rule repository.
Enter a regex pattern and test it against log data. Matches will be highlighted in the results.
Write a regex pattern that matches the described target. Your pattern will be tested against sample data to verify it works.
1. What does the regex metacharacter \d match?
2. Which pattern matches "color" and "colour"?
3. What does ^ mean at the start of a regex?
4. Which quantifier means "exactly 3 times"?
5. To match a literal dot in regex, you should use:
6. What Splunk command uses regex to extract fields from log data?
7. What is "catastrophic backtracking" in regex?