RegEx Primer
Finding Patterns in Logs (or other files)¶
What is Regex?¶
Regular Expressions (regex) are sequences of characters that form search patterns. They are used for matching, searching, and manipulating text, making them incredibly useful for analyzing data, detecting patterns, and automating tasks. In cybersecurity, regex can help identify sensitive information, extract useful data from logs, and detect anomalies.
Basic Concepts of Regex¶
- Literal Characters: Match exactly what you type (e.g.,
abc
matches "abc"). - Metacharacters: Special characters with unique functions:
.
: Matches any character except a newline.^
: Anchors the match to the start of a line.$
: Anchors the match to the end of a line.-
\
: Escapes a metacharacter to treat it as a literal. -
Character Classes: Define a set of characters:
[0-9]
or\d
: Matches any digit.-
[a-zA-Z]
: Matches any letter (uppercase or lowercase). -
Quantifiers: Define how many times an element must appear:
*
: Matches 0 or more times.+
: Matches 1 or more times.?
: Matches 0 or 1 time.-
{n,m}
: Matches betweenn
(minimum) andm
(maximum) times. -
Grouping and Capturing: Parentheses
()
group patterns and capture matched text.
Why Use Regex in Cybersecurity?¶
Regex is essential in cybersecurity for tasks such as:
- Log Analysis: Quickly search and filter through logs to find specific events, IP addresses, error codes, or patterns.
- Data Extraction: Extract sensitive information like credit card numbers, email addresses, or phone numbers.
- Intrusion Detection: Identify patterns indicative of malicious activity, like SQL injection attempts, XSS payloads, or anomalous user behavior.
- Data Sanitization: Validate and sanitize inputs to prevent injection attacks.
Common Regex Patterns for Cybersecurity¶
1. Credit Card Numbers¶
Caveat: The all-inclusive regex pattern for credit card numbers (\b(?:\d[ -]*?){13,16}\b
) can produce false positives because it matches any sequence of 13 to 16 digits, including non-credit card numbers. It’s essential to validate matches contextually or refine the regex for specific card types to minimize false positives.
Regex:
\b(?:\d[ -]*?){13,16}\b
-
Scenario: Identify and mask credit card numbers in application logs to prevent sensitive data exposure.
-
Commands:
- Using
grep
:grep -E '\b(?:\d[ -]*?){13,16}\b' /path/to/logfile.log
- Using
awk
:awk '/[0-9]{13,16}/' /path/to/logfile.log
Individual Credit Card Patterns¶
- Visa¶
\b4[0-9]{12}(?:[0-9]{3})?\b
grep -E '\b4[0-9]{12}(?:[0-9]{3})?\b' /path/to/logfile.log
- MasterCard¶
\b(?:5[1-5][0-9]{2}|22[2-9][0-9]{1}|2[3-6][0-9]{2}|27[01][0-9]|2720)[0-9]{12}\b
grep -E '\b(?:5[1-5][0-9]{2}|22[2-9][0-9]{1}|2[3-6][0-9]{2}|27[01][0-9]|2720)[0-9]{12}\b' /path/to/logfile.log
- American Express (Amex)¶
\b3[47][0-9]{13}\b
grep -E '\b3[47][0-9]{13}\b' /path/to/logfile.log
2. Email Addresses¶
Regex:
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
- Scenario: Extract email addresses from phishing emails to identify potential targets.
-
Caveat: This regex may not match all valid email formats, such as those with newer TLDs or special characters in the domain part.
-
Commands:
- Using
grep
:grep -E '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' /path/to/logfile.log
- Using
awk
:awk '{ for(i=1; i<=NF; i++) if ($i ~ /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/) print $i }' /path/to/logfile.log
3. IPv4 Addresses¶
Refined Regex: To accurately match IPv4 addresses and ensure that each octet is within the valid range of 0-255, use the following pattern:
\b((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\b
-
Scenario: Identify source IP addresses from logs to trace back unauthorized access attempts.
-
Commands:
- Using
grep
:grep -Eo '\b((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.){3}(25[0-5]|2[0-4][0-9]|[1-9]?[0-9])\b' /path/to/logfile.log
- Using
awk
:awk '{ for(i=1; i<=NF; i++) if ($i ~ /\b((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\b/) print $i }' /path/to/logfile.log
4. IPv6 Addresses¶
Refined Regex: To capture all valid IPv6 formats, including compressed and mixed, use the comprehensive pattern below:
\b(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]|)[0-9])\
.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]|)[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]|)[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]|)[0-9]))\b
-
Scenario: Analyze traffic logs to monitor IPv6 usage and detect unusual patterns.
-
Commands:
- Using
grep
:grep -E '\b(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]|)[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]|)[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]|)[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]|)[0-9]))\b' /path/to/logfile.log
5. URLs¶
Regex:
https?:\/\/(?:www\.)?[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\/?
-
Scenario: Scrape URLs from a database of reported phishing websites to update a blacklist.
-
Commands:
- Using
grep
:grep -Eo 'https?:\/\/(?:www\.)?[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\/?' /path/to/logfile.log
- Using
awk
:awk '{ for(i=1; i<=NF; i++) if ($i ~ /https?:\/\/[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\/?/) print $i }' /path/to/logfile.log
6. MAC Addresses¶
Regex:
\b([0-9A-Fa-f]{2}[:-]){5}([0-9A-Fa-f]{2})\b
-
Scenario: Identify MAC addresses in network logs to pinpoint specific devices involved in suspicious activity.
-
Commands:
- Using
grep
:grep -Eo '\b([0-9A-Fa-f]{2}[:-]){5}([0-9A-Fa-f]{2})\b' /path/to/logfile.log
7. SSNs (U.S. Social Security Numbers)¶
Caveat: The SSN pattern can match other sequences, such as dates or similar formats. Be cautious and consider contextual validation when using this pattern.
Regex:
\b\d{3}-\d{2}-\d{4}\b
-
Scenario: Detect exposed SSNs in data dumps to assess the scope of a data breach.
-
Commands:
- Using
grep
:grep -E '\b\d{3}-\d{2}-\d{4}\b' /path/to/logfile.log
8. Phone Numbers (U.S. format)¶
Regex:
\b(?:\(?\d{3}\)?[-.\s]?)?\d{3}[-.\s]?\d{4}\b
- Scenario: Extract phone numbers from scam messages to build a database for investigation.
-
Caveat: This regex strictly targets U.S. phone numbers; other formats may require different patterns.
-
Commands:
- Using
grep
:grep -Eo '\b(?:\(?\d{3}\)?[-.\s]?)?\d{3}[-.\s]?\d{4}\b' /path/to/logfile.log
- Using
awk
:awk '{ for(i=1; i<=NF; i++) if ($i ~ /\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}/) print $i }' /path/to/logfile.log
Practical Tips for Using Regex¶
- Testing and Debugging: Use regex testers like regex101 to test your patterns with sample data before deployment.
- Performance Considerations: Complex regex patterns can be slow, especially on large datasets; optimize patterns where possible.
- Sanitization and Validation: Ensure regex patterns do not create vulnerabilities by validating input lengths and formats properly.
- Escaping Characters: Be aware of escaping special characters in logs or scripts to ensure accurate matches.