Skip to content

RegEx Primer

greppinglogs-chacho

Finding Patterns in Logs (or other files)

What is Regex?

Regular Expressions (regex) are sequences of characters that form search patterns. They are used for matching, searching, and manipulating text, making them incredibly useful for analyzing data, detecting patterns, and automating tasks. In cybersecurity, regex can help identify sensitive information, extract useful data from logs, and detect anomalies.

Basic Concepts of Regex

  • Literal Characters: Match exactly what you type (e.g., abc matches "abc").
  • Metacharacters: Special characters with unique functions:
  • .: Matches any character except a newline.
  • ^: Anchors the match to the start of a line.
  • $: Anchors the match to the end of a line.
  • \: Escapes a metacharacter to treat it as a literal.

  • Character Classes: Define a set of characters:

  • [0-9] or \d: Matches any digit.
  • [a-zA-Z]: Matches any letter (uppercase or lowercase).

  • Quantifiers: Define how many times an element must appear:

  • *: Matches 0 or more times.
  • +: Matches 1 or more times.
  • ?: Matches 0 or 1 time.
  • {n,m}: Matches between n (minimum) and m (maximum) times.

  • Grouping and Capturing: Parentheses () group patterns and capture matched text.

Why Use Regex in Cybersecurity?

Regex is essential in cybersecurity for tasks such as:

  • Log Analysis: Quickly search and filter through logs to find specific events, IP addresses, error codes, or patterns.
  • Data Extraction: Extract sensitive information like credit card numbers, email addresses, or phone numbers.
  • Intrusion Detection: Identify patterns indicative of malicious activity, like SQL injection attempts, XSS payloads, or anomalous user behavior.
  • Data Sanitization: Validate and sanitize inputs to prevent injection attacks.

Common Regex Patterns for Cybersecurity

1. Credit Card Numbers

Caveat: The all-inclusive regex pattern for credit card numbers (\b(?:\d[ -]*?){13,16}\b) can produce false positives because it matches any sequence of 13 to 16 digits, including non-credit card numbers. It’s essential to validate matches contextually or refine the regex for specific card types to minimize false positives.

Regex:

\b(?:\d[ -]*?){13,16}\b

  • Scenario: Identify and mask credit card numbers in application logs to prevent sensitive data exposure.

  • Commands:

  • Using grep:
    grep -E '\b(?:\d[ -]*?){13,16}\b' /path/to/logfile.log
    
  • Using awk:
    awk '/[0-9]{13,16}/' /path/to/logfile.log
    

Individual Credit Card Patterns

- Visa

\b4[0-9]{12}(?:[0-9]{3})?\b
- Use Case: Detect Visa card numbers in user-submitted forms to ensure data compliance. - Command:
grep -E '\b4[0-9]{12}(?:[0-9]{3})?\b' /path/to/logfile.log

- MasterCard

\b(?:5[1-5][0-9]{2}|22[2-9][0-9]{1}|2[3-6][0-9]{2}|27[01][0-9]|2720)[0-9]{12}\b
- Use Case: Filter MasterCard numbers from transaction data for audit purposes. - Command:
grep -E '\b(?:5[1-5][0-9]{2}|22[2-9][0-9]{1}|2[3-6][0-9]{2}|27[01][0-9]|2720)[0-9]{12}\b' /path/to/logfile.log

- American Express (Amex)

\b3[47][0-9]{13}\b
- Use Case: Capture Amex numbers in breach data to notify affected users. - Command:
grep -E '\b3[47][0-9]{13}\b' /path/to/logfile.log

2. Email Addresses

Regex:

[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}

  • Scenario: Extract email addresses from phishing emails to identify potential targets.
  • Caveat: This regex may not match all valid email formats, such as those with newer TLDs or special characters in the domain part.

  • Commands:

  • Using grep:
    grep -E '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' /path/to/logfile.log
    
  • Using awk:
    awk '{ for(i=1; i<=NF; i++) if ($i ~ /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/) print $i }' /path/to/logfile.log
    

3. IPv4 Addresses

Refined Regex: To accurately match IPv4 addresses and ensure that each octet is within the valid range of 0-255, use the following pattern:

\b((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\b
  • Scenario: Identify source IP addresses from logs to trace back unauthorized access attempts.

  • Commands:

  • Using grep:
    grep -Eo '\b((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.){3}(25[0-5]|2[0-4][0-9]|[1-9]?[0-9])\b' /path/to/logfile.log
    
  • Using awk:
    awk '{ for(i=1; i<=NF; i++) if ($i ~ /\b((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\b/) print $i }' /path/to/logfile.log
    

4. IPv6 Addresses

Refined Regex: To capture all valid IPv6 formats, including compressed and mixed, use the comprehensive pattern below:

\b(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]|)[0-9])\

.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]|)[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]|)[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]|)[0-9]))\b
  • Scenario: Analyze traffic logs to monitor IPv6 usage and detect unusual patterns.

  • Commands:

  • Using grep:
    grep -E '\b(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]|)[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]|)[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]|)[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]|)[0-9]))\b' /path/to/logfile.log
    

5. URLs

Regex:

https?:\/\/(?:www\.)?[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\/?

  • Scenario: Scrape URLs from a database of reported phishing websites to update a blacklist.

  • Commands:

  • Using grep:
    grep -Eo 'https?:\/\/(?:www\.)?[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\/?' /path/to/logfile.log
    
  • Using awk:
    awk '{ for(i=1; i<=NF; i++) if ($i ~ /https?:\/\/[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\/?/) print $i }' /path/to/logfile.log
    

6. MAC Addresses

Regex:

\b([0-9A-Fa-f]{2}[:-]){5}([0-9A-Fa-f]{2})\b

  • Scenario: Identify MAC addresses in network logs to pinpoint specific devices involved in suspicious activity.

  • Commands:

  • Using grep:
    grep -Eo '\b([0-9A-Fa-f]{2}[:-]){5}([0-9A-Fa-f]{2})\b' /path/to/logfile.log
    

7. SSNs (U.S. Social Security Numbers)

Caveat: The SSN pattern can match other sequences, such as dates or similar formats. Be cautious and consider contextual validation when using this pattern.

Regex:

\b\d{3}-\d{2}-\d{4}\b

  • Scenario: Detect exposed SSNs in data dumps to assess the scope of a data breach.

  • Commands:

  • Using grep:
    grep -E '\b\d{3}-\d{2}-\d{4}\b' /path/to/logfile.log
    

8. Phone Numbers (U.S. format)

Regex:

\b(?:\(?\d{3}\)?[-.\s]?)?\d{3}[-.\s]?\d{4}\b

  • Scenario: Extract phone numbers from scam messages to build a database for investigation.
  • Caveat: This regex strictly targets U.S. phone numbers; other formats may require different patterns.

  • Commands:

  • Using grep:
    grep -Eo '\b(?:\(?\d{3}\)?[-.\s]?)?\d{3}[-.\s]?\d{4}\b' /path/to/logfile.log
    
  • Using awk:
    awk '{ for(i=1; i<=NF; i++) if ($i ~ /\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}/) print $i }' /path/to/logfile.log
    

Practical Tips for Using Regex

  1. Testing and Debugging: Use regex testers like regex101 to test your patterns with sample data before deployment.
  2. Performance Considerations: Complex regex patterns can be slow, especially on large datasets; optimize patterns where possible.
  3. Sanitization and Validation: Ensure regex patterns do not create vulnerabilities by validating input lengths and formats properly.
  4. Escaping Characters: Be aware of escaping special characters in logs or scripts to ensure accurate matches.