Skip to content

RegEx Primer

greppinglogs-chacho


Finding Patterns in Logs (or Other Files)

What is Regex?

Regular Expressions (regex) are sequences of characters that form search patterns. They are used for matching, searching, and manipulating text, making them incredibly useful for analyzing data, detecting patterns, and automating tasks. In cybersecurity, regex can help identify sensitive information, extract useful data from logs, and detect anomalies.

Different Regex Formats

Regex patterns come in several different formats, each suited to specific use cases:

  • Basic Regular Expressions (BRE): Simple, portable expressions that match literal text or basic patterns. Use when you need simplicity without advanced matching requirements.
  • Extended Regular Expressions (ERE): Adds flexibility with operators like +, ?, and {}, useful for moderately complex patterns.
  • Perl-Compatible Regular Expressions (PCRE): Highly versatile, supporting lookaheads, lookbehinds, and more. Ideal for complex patterns and advanced searches.
  • POSIX Regular Expressions: Found in POSIX tools (like awk), with specific character classes like [[:alnum:]]. Choose for cross-platform consistency.

Basic Concepts of Regex

Literal Characters

Match exactly what you type (e.g., abc matches "abc").

Metacharacters

Special characters with unique functions:

  • .: Matches any character except a newline.
  • ^: Anchors the match to the start of a line.
  • $: Anchors the match to the end of a line.
  • \: Escapes a metacharacter to treat it as a literal.

Character Classes

Define a set of characters:

  • [0-9] or \d: Matches any digit.
  • [a-zA-Z]: Matches any letter (uppercase or lowercase).

Quantifiers

Define how many times an element must appear:

  • *: Matches 0 or more times.
  • +: Matches 1 or more times.
  • ?: Matches 0 or 1 time.
  • {n,m}: Matches between n (minimum) and m (maximum) times.

Grouping and Capturing

Parentheses () group patterns and capture matched text.

Why Use Regex in Cybersecurity?

  • Log Analysis: Quickly search and filter through logs to find specific events, IP addresses, error codes, or patterns.
  • Data Extraction: Extract sensitive information like credit card numbers, email addresses, or phone numbers.
  • Intrusion Detection: Identify patterns indicative of malicious activity, like SQL injection attempts, XSS payloads, or anomalous user behavior.
  • Data Sanitization: Validate and sanitize inputs to prevent injection attacks.

Choosing a Regex Format

Basic Regular Expressions (BRE)

When to Use:

Use BRE when working with simple patterns and in cases where compatibility with various systems is a factor.

Example:

grep -Bil '(secret|confidential|sensitive)' /path/to/file.txt

This command uses BRE to search for "secret," "confidential," or "sensitive" in the file.

Extended Regular Expressions (ERE)

When to Use:

Choose ERE for moderately complex patterns where you need features like + for "one or more," ? for "zero or one," or {} for specifying occurrences.

Example:

grep -Eil '(secret|confidential|sensitive)' /path/to/file.txt

The -E flag enables ERE, allowing the use of | for alternation (like (secret|confidential|sensitive)), which can simplify pattern creation for moderate complexity.

Perl-Compatible Regular Expressions (PCRE)

When to Use:

PCRE is best suited for complex patterns, especially when needing advanced features like lookaheads, lookbehinds, or shorthand character classes like \d and \w.

Example:

grep -Pail '(?i)(secret|confidential|sensitive)' /path/to/file.txt

The -P flag enables PCRE, supporting case-insensitive matching with (?i), along with other complex syntax.

POSIX Regular Expressions

When to Use:

Use POSIX-compliant tools like awk for cross-platform consistency with specific character classes ([:alnum:], [:space:]).

Example:

awk '/secret|confidential|sensitive/' /path/to/file.txt

This uses POSIX regex with awk, making it compatible across Unix-like systems for simpler matches.

grep Location Scope

It is worth mentioning that when you use grep with a * in the /path/to/file parameter, as shown, it is searching the content of the current directory's files and does NOT include hidden files. It is not recursive and it is not searching the names of the files for the pattern.

```bash
grep 'pattern' *
```

To specify that the pattern should search the content of subdirectories as well, use -r or -R.

```bash
grep -r 'pattern' *
```
  • -r: Searches recursively through files and directories.
  • -R: Works like -r but also follows symbolic links.
  • *: As a location, does not include hidden files.
  • .: As a location, includes the pwd as a whole, including hidden directories, but still requires the -r or -R to search recursively.

Common Regex Patterns for Cybersecurity

1. Credit Card Numbers

To detect credit card numbers, we can use a regex pattern that matches typical formats.

\b(?:\d[ -]*?){13,16}\b
Usage Example:
grep -E '\b(?:\d[ -]*?){13,16}\b' /path/to/logfile.log

Choose Extended (ERE) or Perl-Compatible (PCRE) for easier alternation and quantifier handling. ERE is more compatible across systems, while PCRE offers more versatility for complex needs.

2. Email Addresses

A standard pattern for email addresses can be effective for capturing emails in logs.

[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
Usage Example:
grep -E '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' /path/to/logfile.log

Use Extended (ERE) for straightforward matching or PCRE if you need advanced email validation.

3. IPv4 Addresses

IPv4 addresses require precise regex to match valid octet ranges.

\b((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\b
Usage Example:
grep -Eo '\b((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\.){3}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])\b' /path/to/logfile.log

Extended (ERE) is often enough, but PCRE provides flexibility with shorthand for digits (\d) and complex patterns if needed.


Practical Tips for Using Regex

  1. Testing and Debugging: Use online regex testers like regex101 to test patterns with sample data.
  2. Performance Considerations: Complex patterns can be slow, especially with PCRE. Optimize by using ERE or BRE if advanced features aren’t required.
  3. Choose the Right Format: Use BRE for simplicity, ERE for moderate complexity, and PCRE for advanced needs.
  4. Sanitization and Validation: Always validate input lengths and formats, especially in PCRE, to avoid matching unintended data.